强化学习概述

强化学习(Reinforcement Learning, RL)是机器学习的一个重要分支,通过智能体与环境的交互来学习最优策略。

🎯 核心概念

基本要素:

  • 智能体(Agent):学习和决策的主体
  • 环境(Environment):智能体所处的外部世界
  • 状态(State):环境的当前情况
  • 动作(Action):智能体可以执行的操作
  • 奖励(Reward):环境对智能体动作的反馈

📊 马尔可夫决策过程(MDP)

强化学习问题通常建模为马尔可夫决策过程:

定义: MDP = (S, A, P, R, γ)

  • S: 状态空间
  • A: 动作空间
  • P: 状态转移概率 P(s’|s,a)
  • R: 奖励函数 R(s,a,s’)
  • γ: 折扣因子 (0 ≤ γ ≤ 1)

马尔可夫性质:

1
P(S_{t+1} = s' | S_t = s, A_t = a, S_{t-1}, A_{t-1}, ..., S_0, A_0) = P(S_{t+1} = s' | S_t = s, A_t = a)

🎲 价值函数

状态价值函数

1
V^π(s) = E_π[G_t | S_t = s] = E_π[∑_{k=0}^∞ γ^k R_{t+k+1} | S_t = s]

动作价值函数(Q函数)

1
Q^π(s,a) = E_π[G_t | S_t = s, A_t = a] = E_π[∑_{k=0}^∞ γ^k R_{t+k+1} | S_t = s, A_t = a]

贝尔曼方程

状态价值函数的贝尔曼方程:

1
V^π(s) = ∑_a π(a|s) ∑_{s'} P(s'|s,a)[R(s,a,s') + γV^π(s')]

Q函数的贝尔曼方程:

1
Q^π(s,a) = ∑_{s'} P(s'|s,a)[R(s,a,s') + γ ∑_{a'} π(a'|s')Q^π(s',a')]

🧮 经典算法

1. 动态规划

策略评估(Policy Evaluation)

1
2
3
4
5
6
7
8
9
10
11
12
13
def policy_evaluation(policy, env, theta=1e-6, gamma=0.9):
V = np.zeros(env.nS)
while True:
delta = 0
for s in range(env.nS):
v = V[s]
V[s] = sum([policy[s][a] * sum([p * (r + gamma * V[s_])
for p, s_, r, _ in env.P[s][a]])
for a in range(env.nA)])
delta = max(delta, abs(v - V[s]))
if delta < theta:
break
return V

策略改进(Policy Improvement)

1
2
3
4
5
6
7
8
9
10
def policy_improvement(V, env, gamma=0.9):
policy = np.zeros([env.nS, env.nA]) / env.nA
for s in range(env.nS):
action_values = np.zeros(env.nA)
for a in range(env.nA):
action_values[a] = sum([p * (r + gamma * V[s_])
for p, s_, r, _ in env.P[s][a]])
best_action = np.argmax(action_values)
policy[s] = np.eye(env.nA)[best_action]
return policy

2. 蒙特卡洛方法

First-Visit MC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def first_visit_mc_prediction(policy, env, num_episodes, gamma=0.9):
V = defaultdict(float)
returns = defaultdict(list)

for episode in range(num_episodes):
states, actions, rewards = generate_episode(policy, env)
G = 0
visited = set()

for t in reversed(range(len(states))):
G = gamma * G + rewards[t]
if states[t] not in visited:
returns[states[t]].append(G)
V[states[t]] = np.mean(returns[states[t]])
visited.add(states[t])

return V

3. 时序差分学习

TD(0)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def td_prediction(policy, env, num_episodes, alpha=0.1, gamma=0.9):
V = defaultdict(float)

for episode in range(num_episodes):
state = env.reset()
while True:
action = choose_action(policy, state)
next_state, reward, done, _ = env.step(action)

# TD更新
V[state] += alpha * (reward + gamma * V[next_state] - V[state])

if done:
break
state = next_state

return V

Q-Learning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def q_learning(env, num_episodes, alpha=0.1, gamma=0.9, epsilon=0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n))

for episode in range(num_episodes):
state = env.reset()
while True:
# ε-贪心策略
if random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])

next_state, reward, done, _ = env.step(action)

# Q-Learning更新
Q[state][action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state][action]
)

if done:
break
state = next_state

return Q

SARSA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def sarsa(env, num_episodes, alpha=0.1, gamma=0.9, epsilon=0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n))

for episode in range(num_episodes):
state = env.reset()
action = epsilon_greedy_action(Q, state, epsilon)

while True:
next_state, reward, done, _ = env.step(action)
next_action = epsilon_greedy_action(Q, next_state, epsilon)

# SARSA更新
Q[state][action] += alpha * (
reward + gamma * Q[next_state][next_action] - Q[state][action]
)

if done:
break
state, action = next_state, next_action

return Q

🚀 深度强化学习

DQN核心思想

经验回放(Experience Replay):

1
2
3
4
5
6
7
8
9
class ReplayBuffer:
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)

def add(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))

def sample(self, batch_size):
return random.sample(self.buffer, batch_size)

目标网络(Target Network):

1
2
3
4
5
6
7
8
# 主网络
q_network = DQN(state_size, action_size)
# 目标网络
target_network = DQN(state_size, action_size)

# 定期更新目标网络
if step % update_target == 0:
target_network.load_state_dict(q_network.state_dict())

Policy Gradient基础

REINFORCE算法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def reinforce(policy_net, optimizer, episode):
log_probs = []
rewards = []

# 收集一个episode的数据
state = env.reset()
while True:
action_probs = policy_net(state)
action = torch.multinomial(action_probs, 1)
log_prob = torch.log(action_probs[action])

next_state, reward, done, _ = env.step(action.item())
log_probs.append(log_prob)
rewards.append(reward)

if done:
break
state = next_state

# 计算累积奖励
discounted_rewards = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
discounted_rewards.insert(0, G)

# 策略梯度更新
policy_loss = []
for log_prob, G in zip(log_probs, discounted_rewards):
policy_loss.append(-log_prob * G)

optimizer.zero_grad()
loss = torch.stack(policy_loss).sum()
loss.backward()
optimizer.step()

📝 学习要点总结

理论基础

  1. MDP框架:理解状态、动作、奖励、转移概率
  2. 贝尔曼方程:价值函数的递推关系
  3. 最优性原理:最优策略的性质

算法对比

算法 类型 学习方式 适用场景
动态规划 Model-based 精确计算 小状态空间,已知模型
蒙特卡洛 Model-free 完整回合 回合制任务
TD学习 Model-free 单步更新 连续任务
Q-Learning Off-policy 异策略 探索要求高
SARSA On-policy 同策略 安全性要求高

实践建议

  1. 调参技巧

    • 学习率α:影响收敛速度和稳定性
    • 折扣因子γ:平衡短期和长期奖励
    • 探索率ε:平衡探索和利用
  2. 算法选择

    • 状态空间小:表格型方法
    • 状态空间大:函数近似方法
    • 连续控制:Actor-Critic方法

强化学习是一个理论性很强但应用广泛的领域,需要理论学习和实践相结合! 🎯