Reinforcement Learning

(checks notes)

; put in obsidian later.

is a project that provides an API for all single agent reinforcement learning environments.

Gist:

At the core of Gymnasium is Env, a high-level python class representing MDP (not a perfect reconstruction, missing several components). Every environment specifies the format of valid actions and observations with the action_space and observation_space attributes. Gym allows to automatically load environments, pre-wrapped with several important wrappers through gynasium.make() function.

env = gym.make('CartPole-v1', render_mode="human")

max_episode_steps: Maximum length of an episode.

Env

The class encapsulates an environment with arbitrary behind-the-scenes dynamics through the step() and reset() functions. An environment can be partially or fully observed by single agents.

action_space:

The Space object corresponding to valid actions, all valid actions should be contained with the space.

observation_space:

The Space object corresponding to valid observations, all valid observations should be contained with the space.

Gymnasium has support for a wide range of spaces that users might need:

Box: describes bounded space with upper and lower limits of any n-dimensional shape. Specifically, a Box represents the Cartesian product of n closed intervals.
Discrete: describes a discrete space where {0,1,...,n-1} are the possible values our observation or action can take.
so on.

reset()

Resets the environment to an initial internal state, returning an initial observation and info. This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment.

returns:

observation: Observation of the initial state. This will be element of observation_space and is analogous to the observation returned by step(). An example is a numpy array containing the positions and velocities of the pole in CartPole.
info: This dictionary contains auxiliary information complementing observation. Contains auxiliary diagnostic information (helpful for debugging, learning, and logging). This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward.

step(action)

an action provided by the agent to update the environment state.

Runs one timestep of the environment’s dynamics using the agent actions. When the end of an episode is reached, it is necessary to call reset() to reset this environment’s state for the next episode.

returns:

observation:
reward: The reward as a result of taking the action.
terminated: Whether the agent reaches the terminal state (as defined under the MDP of the task) which can be positive or negative.
terminated: Whether the truncation condition outside the scope of the MDP is satisfied. Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached.
info:

Wrappers

are a convenient way to modify an existing environment without having to alter the underlying code directly. Using wrappers will allow you to avoid a lot of boilerplate code and make your environment more modular.

My Knowledge Base

Explorer

Reinforcement Learning

Env

Wrappers