强化学习概述
强化学习(Reinforcement Learning, RL)是机器学习的一个重要分支,通过智能体与环境的交互来学习最优策略。
🎯 核心概念
基本要素:
- 智能体(Agent):学习和决策的主体
- 环境(Environment):智能体所处的外部世界
- 状态(State):环境的当前情况
- 动作(Action):智能体可以执行的操作
- 奖励(Reward):环境对智能体动作的反馈
📊 马尔可夫决策过程(MDP)
强化学习问题通常建模为马尔可夫决策过程:
定义: MDP = (S, A, P, R, γ)
- S: 状态空间
- A: 动作空间
- P: 状态转移概率 P(s’|s,a)
- R: 奖励函数 R(s,a,s’)
- γ: 折扣因子 (0 ≤ γ ≤ 1)
马尔可夫性质:
1
| P(S_{t+1} = s' | S_t = s, A_t = a, S_{t-1}, A_{t-1}, ..., S_0, A_0) = P(S_{t+1} = s' | S_t = s, A_t = a)
|
🎲 价值函数
状态价值函数
1
| V^π(s) = E_π[G_t | S_t = s] = E_π[∑_{k=0}^∞ γ^k R_{t+k+1} | S_t = s]
|
动作价值函数(Q函数)
1
| Q^π(s,a) = E_π[G_t | S_t = s, A_t = a] = E_π[∑_{k=0}^∞ γ^k R_{t+k+1} | S_t = s, A_t = a]
|
贝尔曼方程
状态价值函数的贝尔曼方程:
1
| V^π(s) = ∑_a π(a|s) ∑_{s'} P(s'|s,a)[R(s,a,s') + γV^π(s')]
|
Q函数的贝尔曼方程:
1
| Q^π(s,a) = ∑_{s'} P(s'|s,a)[R(s,a,s') + γ ∑_{a'} π(a'|s')Q^π(s',a')]
|
🧮 经典算法
1. 动态规划
策略评估(Policy Evaluation)
1 2 3 4 5 6 7 8 9 10 11 12 13
| def policy_evaluation(policy, env, theta=1e-6, gamma=0.9): V = np.zeros(env.nS) while True: delta = 0 for s in range(env.nS): v = V[s] V[s] = sum([policy[s][a] * sum([p * (r + gamma * V[s_]) for p, s_, r, _ in env.P[s][a]]) for a in range(env.nA)]) delta = max(delta, abs(v - V[s])) if delta < theta: break return V
|
策略改进(Policy Improvement)
1 2 3 4 5 6 7 8 9 10
| def policy_improvement(V, env, gamma=0.9): policy = np.zeros([env.nS, env.nA]) / env.nA for s in range(env.nS): action_values = np.zeros(env.nA) for a in range(env.nA): action_values[a] = sum([p * (r + gamma * V[s_]) for p, s_, r, _ in env.P[s][a]]) best_action = np.argmax(action_values) policy[s] = np.eye(env.nA)[best_action] return policy
|
2. 蒙特卡洛方法
First-Visit MC
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| def first_visit_mc_prediction(policy, env, num_episodes, gamma=0.9): V = defaultdict(float) returns = defaultdict(list) for episode in range(num_episodes): states, actions, rewards = generate_episode(policy, env) G = 0 visited = set() for t in reversed(range(len(states))): G = gamma * G + rewards[t] if states[t] not in visited: returns[states[t]].append(G) V[states[t]] = np.mean(returns[states[t]]) visited.add(states[t]) return V
|
3. 时序差分学习
TD(0)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| def td_prediction(policy, env, num_episodes, alpha=0.1, gamma=0.9): V = defaultdict(float) for episode in range(num_episodes): state = env.reset() while True: action = choose_action(policy, state) next_state, reward, done, _ = env.step(action) V[state] += alpha * (reward + gamma * V[next_state] - V[state]) if done: break state = next_state return V
|
Q-Learning
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| def q_learning(env, num_episodes, alpha=0.1, gamma=0.9, epsilon=0.1): Q = defaultdict(lambda: np.zeros(env.action_space.n)) for episode in range(num_episodes): state = env.reset() while True: if random.random() < epsilon: action = env.action_space.sample() else: action = np.argmax(Q[state]) next_state, reward, done, _ = env.step(action) Q[state][action] += alpha * ( reward + gamma * np.max(Q[next_state]) - Q[state][action] ) if done: break state = next_state return Q
|
SARSA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| def sarsa(env, num_episodes, alpha=0.1, gamma=0.9, epsilon=0.1): Q = defaultdict(lambda: np.zeros(env.action_space.n)) for episode in range(num_episodes): state = env.reset() action = epsilon_greedy_action(Q, state, epsilon) while True: next_state, reward, done, _ = env.step(action) next_action = epsilon_greedy_action(Q, next_state, epsilon) Q[state][action] += alpha * ( reward + gamma * Q[next_state][next_action] - Q[state][action] ) if done: break state, action = next_state, next_action return Q
|
🚀 深度强化学习
DQN核心思想
经验回放(Experience Replay):
1 2 3 4 5 6 7 8 9
| class ReplayBuffer: def __init__(self, capacity): self.buffer = deque(maxlen=capacity) def add(self, state, action, reward, next_state, done): self.buffer.append((state, action, reward, next_state, done)) def sample(self, batch_size): return random.sample(self.buffer, batch_size)
|
目标网络(Target Network):
1 2 3 4 5 6 7 8
| q_network = DQN(state_size, action_size)
target_network = DQN(state_size, action_size)
if step % update_target == 0: target_network.load_state_dict(q_network.state_dict())
|
Policy Gradient基础
REINFORCE算法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
| def reinforce(policy_net, optimizer, episode): log_probs = [] rewards = [] state = env.reset() while True: action_probs = policy_net(state) action = torch.multinomial(action_probs, 1) log_prob = torch.log(action_probs[action]) next_state, reward, done, _ = env.step(action.item()) log_probs.append(log_prob) rewards.append(reward) if done: break state = next_state discounted_rewards = [] G = 0 for r in reversed(rewards): G = r + gamma * G discounted_rewards.insert(0, G) policy_loss = [] for log_prob, G in zip(log_probs, discounted_rewards): policy_loss.append(-log_prob * G) optimizer.zero_grad() loss = torch.stack(policy_loss).sum() loss.backward() optimizer.step()
|
📝 学习要点总结
理论基础
- MDP框架:理解状态、动作、奖励、转移概率
- 贝尔曼方程:价值函数的递推关系
- 最优性原理:最优策略的性质
算法对比
| 算法 |
类型 |
学习方式 |
适用场景 |
| 动态规划 |
Model-based |
精确计算 |
小状态空间,已知模型 |
| 蒙特卡洛 |
Model-free |
完整回合 |
回合制任务 |
| TD学习 |
Model-free |
单步更新 |
连续任务 |
| Q-Learning |
Off-policy |
异策略 |
探索要求高 |
| SARSA |
On-policy |
同策略 |
安全性要求高 |
实践建议
调参技巧:
- 学习率α:影响收敛速度和稳定性
- 折扣因子γ:平衡短期和长期奖励
- 探索率ε:平衡探索和利用
算法选择:
- 状态空间小:表格型方法
- 状态空间大:函数近似方法
- 连续控制:Actor-Critic方法
强化学习是一个理论性很强但应用广泛的领域,需要理论学习和实践相结合! 🎯