Bellman update

Q(s, a) ← Q(s, a) + α · (r + γ·max_a' Q(s', a') - Q(s, a)). Off-policy: uses max regardless of behavior.

Advertisement

Exploration

ε-greedy: random action with prob ε, greedy otherwise. Boltzmann. UCB. Balance explore/exploit.

Advertisement

Convergence

Tabular Q-learning proven to converge to optimal Q* under mild conditions. Requires visiting all state-action pairs infinitely.

Deep Q-Network (DQN)

Neural net approximates Q. Experience replay + target network for stability. Atari-level performance from pixels (2015).