7 つの一般的な強化学習アルゴリズムとコード実装-AI-php.cn

現在一般的な強化学習アルゴリズムには、Q ラーニング、SARSA、DDPG、A2C、PPO、DQN、TRPO などがあります。これらのアルゴリズムは、ゲーム、ロボット工学、意思決定などのさまざまなアプリケーションで使用されており、常に開発と改良が行われている人気のアルゴリズムですので、今回はそれらについて簡単に紹介します。

7 つの一般的な強化学習アルゴリズムとコード実装

1. Q ラーニング

Q ラーニング: Q ラーニングは、モデルフリーの非戦略的な強化学習アルゴリズムです。ベルマン方程式を使用して最適なアクション値関数を推定します。これにより、特定の状態とアクションのペアの推定値が繰り返し更新されます。 Q ラーニングは、そのシンプルさと大規模な連続状態空間を処理できることで知られています。

以下は、Python を使用して Q ラーニングを実装する簡単な例です。

import numpy as np
 
 # Define the Q-table and the learning rate
 Q = np.zeros((state_space_size, action_space_size))
 alpha = 0.1
 
 # Define the exploration rate and discount factor
 epsilon = 0.1
 gamma = 0.99
 
 for episode in range(num_episodes):
 current_state = initial_state
 while not done:
 # Choose an action using an epsilon-greedy policy
 if np.random.uniform(0, 1) < epsilon:
 action = np.random.randint(0, action_space_size)
 else:
 action = np.argmax(Q[current_state])
 
 # Take the action and observe the next state and reward
 next_state, reward, done = take_action(current_state, action)
 
 # Update the Q-table using the Bellman equation
 Q[current_state, action] = Q[current_state, action] + alpha * (reward + gamma * np.max(Q[next_state]) - Q[current_state, action])
 
 current_state = next_state

ログイン後にコピー

上の例では、state_space_size と action_space_size は、それぞれ環境内の状態とアクションの数です。 num_episodes は、アルゴリズムを実行するラウンド数です。 initial_state は環境の開始状態です。 take_action(current_state, action) は、現在の状態とアクションを入力として受け取り、次の状態、報酬、ラウンドが完了したかどうかを示すブール値を返す関数です。

while ループでは、epsilon-greedy 戦略を使用して、現在の状態に基づいてアクションを選択します。確率イプシロンを使用してランダムなアクションを選択し、確率 1-イプシロンを使用して現在の状態で最も高い Q 値を持つアクションを選択します。

アクションを実行した後、次の状態と報酬を観察し、ベルマン方程式を使用して q を更新します。そして現在の状態を次の状態に更新します。これは Q 学習の単純な例にすぎず、Q テーブルの初期化や解決すべき問題の具体的な詳細は考慮されていません。

2. SARSA

SARSA: SARSA は、モデルフリーのポリシーベースの強化学習アルゴリズムです。また、ベルマン方程式を使用して行動価値関数を推定しますが、Q 学習のような最適な行動ではなく、次の行動の期待値に基づいています。 SARSA は確率力学問題を処理できることで知られています。

import numpy as np
 
 # Define the Q-table and the learning rate
 Q = np.zeros((state_space_size, action_space_size))
 alpha = 0.1
 
 # Define the exploration rate and discount factor
 epsilon = 0.1
 gamma = 0.99
 
 for episode in range(num_episodes):
 current_state = initial_state
 action = epsilon_greedy_policy(epsilon, Q, current_state)
 while not done:
 # Take the action and observe the next state and reward
 next_state, reward, done = take_action(current_state, action)
 # Choose next action using epsilon-greedy policy
 next_action = epsilon_greedy_policy(epsilon, Q, next_state)
 # Update the Q-table using the Bellman equation
 Q[current_state, action] = Q[current_state, action] + alpha * (reward + gamma * Q[next_state, next_action] - Q[current_state, action])
 current_state = next_state
 action = next_action

ログイン後にコピー

state_space_size と action_space_size は、それぞれ環境内の状態と操作の数です。 num_episodes は、SARSA アルゴリズムを実行するラウンド数です。 Initial_state は環境の初期状態です。 take_action(current_state, action) は、現在の状態とアクションを入力として受け取り、次の状態、報酬、およびプロットが完了したかどうかを示すブール値を返す関数です。

while ループでは、別の関数 epsilon_greedy_policy(epsilon, Q, current_state) で定義された epsilon-greedy ポリシーを使用して、現在の状態に基づいてアクションを選択します。確率イプシロンを使用してランダムなアクションを選択し、確率 1-イプシロンを使用して現在の状態に対して最も高い Q 値を持つアクションを選択します。

上記は Q 学習と同じですが、アクションを実行した後、次の状態と報酬を観察しながら次のアクションを選択する貪欲な戦略を使用します。そして、ベルマン方程式を使用して q テーブルを更新します。

3. DDPG

DDPG は、連続アクションスペース用のモデルフリーの非ポリシーアルゴリズムです。これは、アクションの選択にアクターネットワークが使用され、アクションの評価に批評家ネットワークが使用されるアクター-クリティカルアルゴリズムです。 DDPG は、ロボット制御やその他の継続的な制御タスクに特に役立ちます。

import numpy as np
 from keras.models import Model, Sequential
 from keras.layers import Dense, Input
 from keras.optimizers import Adam
 
 # Define the actor and critic models
 actor = Sequential()
 actor.add(Dense(32, input_dim=state_space_size, activation='relu'))
 actor.add(Dense(32, activation='relu'))
 actor.add(Dense(action_space_size, activation='tanh'))
 actor.compile(loss='mse', optimizer=Adam(lr=0.001))
 
 critic = Sequential()
 critic.add(Dense(32, input_dim=state_space_size, activation='relu'))
 critic.add(Dense(32, activation='relu'))
 critic.add(Dense(1, activation='linear'))
 critic.compile(loss='mse', optimizer=Adam(lr=0.001))
 
 # Define the replay buffer
 replay_buffer = []
 
 # Define the exploration noise
 exploration_noise = OrnsteinUhlenbeckProcess(size=action_space_size, theta=0.15, mu=0, sigma=0.2)
 
 for episode in range(num_episodes):
 current_state = initial_state
 while not done:
 # Select an action using the actor model and add exploration noise
 action = actor.predict(current_state)[0] + exploration_noise.sample()
 action = np.clip(action, -1, 1)
 
 # Take the action and observe the next state and reward
 next_state, reward, done = take_action(current_state, action)
 
 # Add the experience to the replay buffer
 replay_buffer.append((current_state, action, reward, next_state, done))
 
 # Sample a batch of experiences from the replay buffer
 batch = sample(replay_buffer, batch_size)
 
 # Update the critic model
 states = np.array([x[0] for x in batch])
 actions = np.array([x[1] for x in batch])
 rewards = np.array([x[2] for x in batch])
 next_states = np.array([x[3] for x in batch])
 
 target_q_values = rewards + gamma * critic.predict(next_states)
 critic.train_on_batch(states, target_q_values)
 
 # Update the actor model
 action_gradients = np.array(critic.get_gradients(states, actions))
 actor.train_on_batch(states, action_gradients)
 
 current_state = next_state

ログイン後にコピー

この例では、state_space_size と action_space_size は、それぞれ環境内の状態と操作の数です。 num_episodes はラウンド数です。 Initial_state は環境の初期状態です。 Take_action (current_state, action) は、現在の状態とアクションを入力として受け取り、次のアクションを返す関数です。

4. A2C

A2C (Advantage Actor-Critic) は、Advantage 関数を使用して戦略を更新する戦略的アクター - クリティカルアルゴリズムです。このアルゴリズムは実装が簡単で、離散アクション空間と連続アクション空間の両方を処理できます。

import numpy as np
 from keras.models import Model, Sequential
 from keras.layers import Dense, Input
 from keras.optimizers import Adam
 from keras.utils import to_categorical
 
 # Define the actor and critic models
 state_input = Input(shape=(state_space_size,))
 actor = Dense(32, activation='relu')(state_input)
 actor = Dense(32, activation='relu')(actor)
 actor = Dense(action_space_size, activation='softmax')(actor)
 actor_model = Model(inputs=state_input, outputs=actor)
 actor_model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.001))
 
 state_input = Input(shape=(state_space_size,))
 critic = Dense(32, activation='relu')(state_input)
 critic = Dense(32, activation='relu')(critic)
 critic = Dense(1, activation='linear')(critic)
 critic_model = Model(inputs=state_input, outputs=critic)
 critic_model.compile(loss='mse', optimizer=Adam(lr=0.001))
 
 for episode in range(num_episodes):
 current_state = initial_state
 done = False
 while not done:
 # Select an action using the actor model and add exploration noise
 action_probs = actor_model.predict(np.array([current_state]))[0]
 action = np.random.choice(range(action_space_size), p=action_probs)
 
 # Take the action and observe the next state and reward
 next_state, reward, done = take_action(current_state, action)
 
 # Calculate the advantage
 target_value = critic_model.predict(np.array([next_state]))[0][0]
 advantage = reward + gamma * target_value - critic_model.predict(np.array([current_state]))[0][0]
 
 # Update the actor model
 action_one_hot = to_categorical(action, action_space_size)
 actor_model.train_on_batch(np.array([current_state]), advantage * action_one_hot)
 
 # Update the critic model
 critic_model.train_on_batch(np.array([current_state]), reward + gamma * target_value)
 
 current_state = next_state

ログイン後にコピー

この例では、アクターモデルは、それぞれ 32 個のニューロン、relu 活性化関数、およびソフトマックス活性化関数を備えた出力層を含む 2 つの隠れ層を備えたニューラルネットワークです。 Critic モデルも、2 つの隠れ層、各層に 32 個のニューロン、relu 活性化関数、および線形活性化関数を備えた出力層を備えたニューラルネットワークです。

カテゴリカルクロスエントロピー損失関数を使用してアクターモデルをトレーニングし、平均二乗誤差損失関数を使用してクリティカルモデルをトレーニングします。アクションはアクターモデルの予測に基づいて選択され、探索のためにノイズが追加されます。

5. PPO

PPO (Proximal Policy Optimization) は、信頼ドメインの最適化を使用してポリシーを更新するポリシーアルゴリズムです。これは、高次元の観察や継続的なアクション空間がある環境で特に役立ちます。 PPO は、その安定性と高いサンプル効率で知られています。

import numpy as np
 from keras.models import Model, Sequential
 from keras.layers import Dense, Input
 from keras.optimizers import Adam
 
 # Define the policy model
 state_input = Input(shape=(state_space_size,))
 policy = Dense(32, activation='relu')(state_input)
 policy = Dense(32, activation='relu')(policy)
 policy = Dense(action_space_size, activation='softmax')(policy)
 policy_model = Model(inputs=state_input, outputs=policy)
 
 # Define the value model
 value_model = Model(inputs=state_input, outputs=Dense(1, activation='linear')(policy))
 
 # Define the optimizer
 optimizer = Adam(lr=0.001)
 
 for episode in range(num_episodes):
 current_state = initial_state
 while not done:
 # Select an action using the policy model
 action_probs = policy_model.predict(np.array([current_state]))[0]
 action = np.random.choice(range(action_space_size), p=action_probs)
 
 # Take the action and observe the next state and reward
 next_state, reward, done = take_action(current_state, action)
 
 # Calculate the advantage
 target_value = value_model.predict(np.array([next_state]))[0][0]
 advantage = reward + gamma * target_value - value_model.predict(np.array([current_state]))[0][0]
 
 # Calculate the old and new policy probabilities
 old_policy_prob = action_probs[action]
 new_policy_prob = policy_model.predict(np.array([next_state]))[0][action]
 
 # Calculate the ratio and the surrogate loss
 ratio = new_policy_prob / old_policy_prob
 surrogate_loss = np.minimum(ratio * advantage, np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantage)
 
 # Update the policy and value models
 policy_model.trainable_weights = value_model.trainable_weights
 policy_model.compile(optimizer=optimizer, loss=-surrogate_loss)
 policy_model.train_on_batch(np.array([current_state]), np.array([action_one_hot]))
 value_model.train_on_batch(np.array([current_state]), reward + gamma * target_value)
 
 current_state = next_state

ログイン後にコピー

6、DQN

DQN (ディープ Q ネットワーク) は、ニューラルネットワークを使用して Q 関数を近似する、モデルフリーの非ポリシーアルゴリズムです。 DQN は、状態空間が高次元であり、ニューラルネットワークを使用して Q 関数を近似する Atari ゲームやその他の同様の問題に特に役立ちます。

import numpy as np
 from keras.models import Sequential
 from keras.layers import Dense, Input
 from keras.optimizers import Adam
 from collections import deque
 
 # Define the Q-network model
 model = Sequential()
 model.add(Dense(32, input_dim=state_space_size, activation='relu'))
 model.add(Dense(32, activation='relu'))
 model.add(Dense(action_space_size, activation='linear'))
 model.compile(loss='mse', optimizer=Adam(lr=0.001))
 
 # Define the replay buffer
 replay_buffer = deque(maxlen=replay_buffer_size)
 
 for episode in range(num_episodes):
 current_state = initial_state
 while not done:
 # Select an action using an epsilon-greedy policy
 if np.random.rand() < epsilon:
 action = np.random.randint(0, action_space_size)
 else:
 action = np.argmax(model.predict(np.array([current_state]))[0])
 
 # Take the action and observe the next state and reward
 next_state, reward, done = take_action(current_state, action)
 
 # Add the experience to the replay buffer
 replay_buffer.append((current_state, action, reward, next_state, done))
 
 # Sample a batch of experiences from the replay buffer
 batch = random.sample(replay_buffer, batch_size)
 
 # Prepare the inputs and targets for the Q-network
 inputs = np.array([x[0] for x in batch])
 targets = model.predict(inputs)
 for i, (state, action, reward, next_state, done) in enumerate(batch):
 if done:
 targets[i, action] = reward
 else:
 targets[i, action] = reward + gamma * np.max(model.predict(np.array([next_state]))[0])
 
 # Update the Q-network
 model.train_on_batch(inputs, targets)
 
 current_state = next_state

ログイン後にコピー

上面的代码，Q-network有2个隐藏层，每个隐藏层有32个神经元，使用relu激活函数。该网络使用均方误差损失函数和Adam优化器进行训练。

7、TRPO

TRPO （Trust Region Policy Optimization）是一种无模型的策略算法，它使用信任域优化方法来更新策略。它在具有高维观察和连续动作空间的环境中特别有用。

TRPO 是一个复杂的算法，需要多个步骤和组件来实现。TRPO不是用几行代码就能实现的简单算法。

所以我们这里使用实现了TRPO的现有库，例如OpenAI Baselines，它提供了包括TRPO在内的各种预先实现的强化学习算法，。

要在OpenAI Baselines中使用TRPO，我们需要安装:

pip install baselines

ログイン後にコピー

然后可以使用baselines库中的trpo_mpi模块在你的环境中训练TRPO代理，这里有一个简单的例子:

import gym
 from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
 from baselines.trpo_mpi import trpo_mpi
 
 #Initialize the environment
 env = gym.make("CartPole-v1")
 env = DummyVecEnv([lambda: env])
 
 # Define the policy network
 policy_fn = mlp_policy
 
 #Train the TRPO model
 model = trpo_mpi.learn(env, policy_fn, max_iters=1000)

ログイン後にコピー

我们使用Gym库初始化环境。然后定义策略网络，并调用TRPO模块中的learn()函数来训练模型。

还有许多其他库也提供了TRPO的实现，例如TensorFlow、PyTorch和RLLib。下面时一个使用TF 2.0实现的样例

import tensorflow as tf
 import gym
 
 # Define the policy network
 class PolicyNetwork(tf.keras.Model):
 def __init__(self):
 super(PolicyNetwork, self).__init__()
 self.dense1 = tf.keras.layers.Dense(16, activation='relu')
 self.dense2 = tf.keras.layers.Dense(16, activation='relu')
 self.dense3 = tf.keras.layers.Dense(1, activation='sigmoid')
 
 def call(self, inputs):
 x = self.dense1(inputs)
 x = self.dense2(x)
 x = self.dense3(x)
 return x
 
 # Initialize the environment
 env = gym.make("CartPole-v1")
 
 # Initialize the policy network
 policy_network = PolicyNetwork()
 
 # Define the optimizer
 optimizer = tf.optimizers.Adam()
 
 # Define the loss function
 loss_fn = tf.losses.BinaryCrossentropy()
 
 # Set the maximum number of iterations
 max_iters = 1000
 
 # Start the training loop
 for i in range(max_iters):
 # Sample an action from the policy network
 action = tf.squeeze(tf.random.categorical(policy_network(observation), 1))
 
 # Take a step in the environment
 observation, reward, done, _ = env.step(action)
 
 with tf.GradientTape() as tape:
 # Compute the loss
 loss = loss_fn(reward, policy_network(observation))
 
 # Compute the gradients
 grads = tape.gradient(loss, policy_network.trainable_variables)
 
 # Perform the update step
 optimizer.apply_gradients(zip(grads, policy_network.trainable_variables))
 
 if done:
 # Reset the environment
 observation = env.reset()

ログイン後にコピー

在这个例子中，我们首先使用TensorFlow的Keras API定义一个策略网络。然后使用Gym库和策略网络初始化环境。然后定义用于训练策略网络的优化器和损失函数。

在训练循环中，从策略网络中采样一个动作，在环境中前进一步，然后使用TensorFlow的GradientTape计算损失和梯度。然后我们使用优化器执行更新步骤。

这是一个简单的例子，只展示了如何在TensorFlow 2.0中实现TRPO。TRPO是一个非常复杂的算法，这个例子没有涵盖所有的细节，但它是试验TRPO的一个很好的起点。