tf_agents.bandits.environments.piecewise_bernoulli_py_environment.PiecewiseBernoulliPyEnvironment

Implements piecewise stationary finite-armed Bernoulli Bandits.

Inherits From: BanditPyEnvironment, PyEnvironment

This environment implements piecewise stationary finite-armed non-contextual Bernoulli Bandit environment as a subclass of BanditPyEnvironment. With respect to Bernoulli stationary environment, the reward distribution parameters undergo abrupt changes at given time steps. The current time is kept by the environment and increased by a unit at each call of _apply_action. For each stationary piece, the reward distribution is 0/1 (Bernoulli) with the parameter p valid for the current piece.

Examples:

means = [[0.1, 0.5], [0.5, 0.1], [0.5, 0.5]] # 3 pieces, 2 arms.

def constant_duration_gen(delta): while True: yield delta

env_piecewise_10_steps = PiecewiseBernoulliPyEnvironment( means, constant_duration_gen(10))

def random_duration_gen(a, b): while True: yield random.randint(a, b)

env_rnd_piecewise_10_to_20_steps = PiecewiseBernoulliPyEnvironment( means, random_duration_gen(10, 20))

For a reference on bandits see e.g., Example 1.1 in "A Tutorial on Thompson Sampling" by Russo et al. (https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf) A paper using piecewise stationary environments is Qingyun Wu, Naveen Iyer, Hongning Wang, ``Learning Contextual Bandits in a Non-stationary Environment,'' Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (https://arxiv.org/pdf/1805.09365.pdf)

piece_means a matrix (list of lists) with shape (num_pieces, num_arms) containing floats in [0, 1]. Each list contains the mean rewards for the num_arms actions of the num_pieces pieces. The list is wrapped around after the last piece.
change_duration_generator a generator of the time durations. If this yields the values d0, d1, d2, ..., then the reward parameters change at steps d0, d0 + d1, d0 + d1 + d2, ..., as following: piece_means[0] for 0 <= t < d0 piece_means[1] for d0 <= t < d0 + d1 piece_means[2] for d0 + d1 <= t < d0 + d1 + d2 ... Note that the values generated have to be non-negative. The value zero means that the corresponding parameters in the piece_means list are skipped, i.e. the duration of the piece is zero steps. If the generator ends (e.g. if it is obtained with iter()) and the step goes beyond the last piece, a StopIteration exception is raised.
batch_size If specified, this is the batch size for observation and actions.

batch_size The batch size of the environment.
batched Whether the environment is batched or not.

If the environment supports batched observations and actions, then overwrite this property to True.

A batched environment takes in a batched set of actions and returns a batched set of observations. This means for all numpy arrays in the input and output nested structures, the first dimension is the batch size.

When batched, the left-most dimension is not part of the action_spec or the observation_spec and corresponds to the batch dimension.

When batched and handle_auto_reset, it checks np.all(steps.is_last()).

name

Methods

action_spec

View source

Defines the actions that should be provided to step().

May use a subclass of ArraySpec that specifies additional properties such as min and max bounds on the values.

Returns
An ArraySpec, or a nested dict, list or tuple of ArraySpecs.

close

View source

Frees any resources used by the environment.

Implement this method for an environment backed by an external process.

This method be used directly

env = Env(...)
# Use env.
env.close()

or via a context manager

with Env(...) as env:
  # Use env.

current_time_step

View source

Returns the current timestep.

discount_spec

View source

Defines the discount that are returned by step().

Override this method to define an environment that uses non-standard discount values, for example an environment with array-valued discounts.

Returns
An ArraySpec, or a nested dict, list or tuple of ArraySpecs.

get_info

View source

Returns the environment info returned on the last step.

Returns
Info returned by last call to step(). None by default.

Raises
NotImplementedError If the environment does not use info.

get_state

View source

Returns the state of the environment.

The state contains everything required to restore the environment to the current configuration. This can contain e.g.

  • The current time_step.
  • The number of steps taken in the environment (for finite horizon MDPs).
  • Hidden state (for POMDPs).

Callers should not assume anything about the contents or format of the returned state. It should be treated as a token that can be passed back to set_state() later.

Note that the returned state handle should not be modified by the environment later on, and ensuring this (e.g. using copy.deepcopy) is the responsibility of the environment.

Returns
state The current state of the environment.

observation_spec

View source

Defines the observations provided by the environment.

May use a subclass of ArraySpec that specifies additional properties such as min and max bounds on the values.

Returns
An ArraySpec, or a nested dict, list or tuple of ArraySpecs.

render

View source

Renders the environment.

Args
mode One of ['rgb_array', 'human']. Renders to an numpy array, or brings up a window where the environment can be visualized.

Returns
An ndarray of shape [width, height, 3] denoting an RGB image if mode is rgb_array. Otherwise return nothing and render directly to a display window.

Raises
NotImplementedError If the environment does not support rendering.

reset

View source

Starts a new sequence and returns the first TimeStep of this sequence.

Returns
A TimeStep namedtuple containing: step_type: A StepType of FIRST. reward: 0.0, indicating the reward. discount: 1.0, indicating the discount. observation: A NumPy array, or a nested dict, list or tuple of arrays corresponding to observation_spec().

reward_spec

View source

Defines the rewards that are returned by step().

Override this method to define an environment that uses non-standard reward values, for example an environment with array-valued rewards.

Returns
An ArraySpec, or a nested dict, list or tuple of ArraySpecs.

seed

View source

Seeds the environment.

Args
seed Value to use as seed for the environment.

set_state

View source

Restores the environment to a given state.

See definition of state in the documentation for get_state().

Args
state A state to restore the environment to.

should_reset

View source

Whether the Environmet should reset given the current timestep.

By default it only resets when all time_steps are LAST.

Args
current_time_step The current TimeStep.

Returns
A bool indicating whether the Environment should reset or not.

step

View source

Updates the environment according to the action and returns a TimeStep.

If the environment returned a TimeStep with StepType.LAST at the previous step the implementation of _step in the environment should call reset to start a new sequence and ignore action.

This method will start a new sequence if called after the environment has been constructed and reset has not been called. In this case action will be ignored.

If should_reset(current_time_step) is True, then this method will reset by itself. In this case action will be ignored.

Args
action A NumPy array, or a nested dict, list or tuple of arrays corresponding to action_spec().

Returns
A TimeStep namedtuple containing: step_type: A StepType value. reward: A NumPy array, reward value for this timestep. discount: A NumPy array, discount in the range [0, 1]. observation: A NumPy array, or a nested dict, list or tuple of arrays corresponding to observation_spec().

time_step_spec

View source

Describes the TimeStep fields returned by step().

Override this method to define an environment that uses non-standard values for any of the items returned by step(). For example, an environment with array-valued rewards.

Returns
A TimeStep namedtuple containing (possibly nested) ArraySpecs defining the step_type, reward, discount, and observation structure.

__enter__

View source

Allows the environment to be used in a with-statement context.

__exit__

View source

Allows the environment to be used in a with-statement context.