Bellman update
Q(s, a) ← Q(s, a) + α · (r + γ·max_a' Q(s', a') - Q(s, a)). Off-policy: uses max regardless of behavior.
Advertisement
Exploration
ε-greedy: random action with prob ε, greedy otherwise. Boltzmann. UCB. Balance explore/exploit.
Advertisement
Convergence
Tabular Q-learning proven to converge to optimal Q* under mild conditions. Requires visiting all state-action pairs infinitely.
Deep Q-Network (DQN)
Neural net approximates Q. Experience replay + target network for stability. Atari-level performance from pixels (2015).