一文带你理解Q-Learning的搜索策略,掌握强化学习最常用算法
态s中选择具有最高Q值的动作。但是在t时刻Q(s,a)值是未知的。在t时刻,Q估计值为Q(t, s, a),则有Q(s,a) = + (Q(s,a) ? )。(Q(s,a) ? )为相应误差项。霍夫不等式(Hoeffding’s inequality)可用来处理这类误差。事实上,当t变化时,有:优先策略可写成:Argmax {Q+(t, s, a)/ a ∈ A},且有:当β大于0时,执行探索动作;当β为0时,仅利用已有估计。这种界限方法是目前最常用的,基于这种界限后面也有许多改进工作,包括UCB-V,UCB*,KL-UCB,Bayes-UCB和BESA[4]等。下面给出经典UCB算法的Python实现,及其在Q-Learning上的应用效果。1defUCB_exploration(Q, num_actions, beta=1):2defUCB_exp(state, N, t):3probs = np.zeros(num_actions, dtype=float)4Q_ = Q[state,:]/max(Q[state,:]) + np.sqrt(beta*np.log(t+1)/(2*N[state]))5best_action = Q_.argmax()6probs[best_action] =17returnprobs8returnUCB_exp△奖励变化曲线UCB搜索算法应该能很快地获得高额奖励,但是前期搜索对训练过程的影响较大,有希望用来解决更复杂的多臂赌博机问题,因为这种方法能帮助智能体跳出局部最优值。下面是两种策略的对比图。总结与展望Q-Learning是强化学习中最常用的算法之一。在这篇文章中,我们讨论了搜索策略的重要性和如何用UCB搜索策略来替代经典的ε-greedy搜索算法。更多更细致的优先策略可以被用到Q-Learning算法中,以平衡好利用和探索的关系。参考文献[1]T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence of stochastic iterative dynamic programming algorithms” Neural computation, vol. 6, no. 6, pp. 1185–1201, 1994.[2]P. Auer, “Using Confidence Bounds for Exploitation-Exploration Trade-offs”, Journal of Machine Learning Research 3 397–422, 2002.[3]E. Even-Dar, and Y. Mansour, “Learning Rates for Q-learning”, Journal of Machine Learning Research 5 1–25, 2003.[4]A. Baransi, O.-A. Maillard, and S. Mannor, “Sub-sampling for multi-armed bandits”, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 115–131, 2014.原文:https://medium.com/sequential-learning/optimistic-q-learning-b9304d079e11—完—
一文带你理解Q-Learning的搜索策略,掌握强化学习最常用算法