OpenAI GYM's env.step(): what are the values?

Question:

I am getting to know OpenAI’s GYM (0.25.1) using Python3.10 with gym’s environment set to 'FrozenLake-v1 (code below).

According to the documentation, calling env.step() should return a tuple containing 4 values (observation, reward, done, info). However, when running my code accordingly, I get a ValueError:

Problematic code:

observation, reward, done, info = env.step(new_action)

Error:

      3 new_action = env.action_space.sample()
----> 5 observation, reward, done, info = env.step(new_action)
      7 # here's a look at what we get back
      8 print(f"observation: {observation}, reward: {reward}, done: {done}, info: {info}")

ValueError: too many values to unpack (expected 4)

Adding one more variable fixes the error:

a, b, c, d, e = env.step(new_action)
print(a, b, c, d, e)

Output:

5 0 True True {'prob': 1.0}

My interpretation:

  • 5 should be observation
  • 0 is reward
  • prob: 1.0 is info
  • One of the True‘s is done

So what’s the leftover boolean standing for?

Thank you for your help!


Complete code:

import gym

env = gym.make('FrozenLake-v1', new_step_api=True, render_mode='ansi') # build environment

current_obs = env.reset() # start new episode

for e in env.render():
    print(e)
    
new_action = env.action_space.sample() # random action

observation, reward, done, info = env.step(new_action) # perform action, ValueError!

for e in env.render():
    print(e)
Asked By: doesnotcompile

||

Answers:

From the code’s docstrings:

       Returns:
           observation (object): this will be an element of the environment's :attr:`observation_space`.
               This may, for instance, be a numpy array containing the positions and velocities of certain objects.
           reward (float): The amount of reward returned as a result of taking the action.
           terminated (bool): whether a `terminal state` (as defined under the MDP of the task) is reached.
               In this case further step() calls could return undefined results.
           truncated (bool): whether a truncation condition outside the scope of the MDP is satisfied.
               Typically a timelimit, but could also be used to indicate agent physically going out of bounds.
               Can be used to end the episode prematurely before a `terminal state` is reached.
           info (dictionary): `info` contains auxiliary diagnostic information (helpful for debugging, learning, and logging).
               This might, for instance, contain: metrics that describe the agent's performance state, variables that are
               hidden from observations, or individual reward terms that are combined to produce the total reward.
               It also can contain information that distinguishes truncation and termination, however this is deprecated in favour
               of returning two booleans, and will be removed in a future version.
           (deprecated)
           done (bool): A boolean value for if the episode has ended, in which case further :meth:`step` calls will return undefined results.
               A done signal may be emitted for different reasons: >Maybe the task underlying the environment was solved successfully,
               a certain timelimit was exceeded, or the physics >simulation has entered an invalid state.

It appears that the first boolean represents a terminated value, i.e. "whether a terminal state (as defined under the MDP of the task) is reached. In this case further step() calls could return undefined results."

It appears that the second represents whether the value has been truncated, i.e. did your agent go out of bounds or not? From the docstring:

"whether a truncation condition outside the scope of the MDP is satisfied. Typically a timelimit, but could also be used to indicate agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached."

Answered By: Jason R Stevens CFA

You may want to consider the new API for creating the env because a temporary wrapper support is provided for the old code and it may cease to be backward compatible some day. Using the new API could have certain minor ramifications to your code (in one line – Dont simply do: done = truncated).

Let us quickly understand the change.

To use the new API, add new_step_api=True option for e.g.

env = gym.make('MountainCar-v0', new_step_api=True)

This causes the env.step() method to return five items instead of four. What is this extra one?

  • Well, in the old API – done was returned as True if episode ends in any way.
  • In the new API, done is split into 2 parts:
  • terminated=True if environment terminates (eg. due to task completion, failure etc.)
  • truncated=True if episode truncates due to a time limit or a reason that is not defined as part of the task MDP.

This is done to remove the ambiguity in the done signal. done=True in the old API did not distinguish between the environment terminating & the episode truncating. This problem was avoided previously by setting info['TimeLimit.truncated'] in case of a timelimit through the TimeLimit wrapper. All that is not required now and the env.step() function returns us:

next_state, reward, terminated, truncated , info = env.step(action)

How could this impact your code:
If your game has some kind of max_steps or timeout, you should read the ‘truncated’ variable IN ADDITION to the ‘terminated’ variable to see if your game ended. Based on the kind of rewards that you have you may want to tweak things slightly. A simplest option could just be to do a

done = truncated OR terminated 

and then proceed to reuse your old code.

Answered By: Allohvk
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.