1992 QLearning

Subject Headings: Q-Learning.

Notes

This paper presents and proves in detail a convergence theorem for [math]\displaystyle{ \cal Q }[/math]-learning based on that outlined in Watkins (1989). We show that [[[math]\displaystyle{ \cal Q }[/math]-learning converge]]s to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely. We also sketch extensions to the cases of non-discounted, but absorbing, Markov environments, and where many [math]\displaystyle{ \cal Q }[/math] values can be changed each iteration, rather than just one.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
1992 QLearning	Christopher J. C. H. Watkins Peter Dayan			Technical Note : \cal Q -Learning				10.1007/BF00992698		1992