Train an Agent in a Reinforcement Learning Environment
This example demonstrates how to train a simple neural net to maximize its reward in the "Simulated Cart Pole" environment using the REINFORCE method (Williams, 1992). The cart pole environment consists of a cart that moves along a frictionless one-dimensional track and a weighted pole attached to the cart by a hinge (a.k.a inverted pendulum). The cart has some initial velocity, such that the pole will fall over without intervention. The aim of the agent is to keep the pole upright for as long as possible. This is accomplished by learning which of two possible actions (move left or move right) should be performed at any given time.
Load and render the environment in its initial state.
Define a simple net that will learn a policy for choosing whether to move the cart left or right.
Define a loss function for policy gradient learning.
Define a generator function that will sample training data for the net.
Train the policy net, measuring the average discounted reward.
Animate the environment with the trained policy net (click the following image to see an animation). Notice that the pole stays upright.
Compare this to an agent taking random actions in the environment (click the following image to see an animation).