View source on GitHub |
Implements piecewise stationary finite-armed Bernoulli Bandits.
Inherits From: BanditPyEnvironment
, PyEnvironment
tf_agents.bandits.environments.piecewise_bernoulli_py_environment.PiecewiseBernoulliPyEnvironment(
piece_means: np.ndarray,
change_duration_generator: Callable[[], int],
batch_size: Optional[int] = 1
)
This environment implements piecewise stationary finite-armed non-contextual Bernoulli Bandit environment as a subclass of BanditPyEnvironment. With respect to Bernoulli stationary environment, the reward distribution parameters undergo abrupt changes at given time steps. The current time is kept by the environment and increased by a unit at each call of _apply_action. For each stationary piece, the reward distribution is 0/1 (Bernoulli) with the parameter p valid for the current piece.
Examples:
means = [[0.1, 0.5], [0.5, 0.1], [0.5, 0.5]] # 3 pieces, 2 arms.
def constant_duration_gen(delta): while True: yield delta
env_piecewise_10_steps = PiecewiseBernoulliPyEnvironment( means, constant_duration_gen(10))
def random_duration_gen(a, b): while True: yield random.randint(a, b)
env_rnd_piecewise_10_to_20_steps = PiecewiseBernoulliPyEnvironment( means, random_duration_gen(10, 20))
For a reference on bandits see e.g., Example 1.1 in "A Tutorial on Thompson Sampling" by Russo et al. (https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf) A paper using piecewise stationary environments is Qingyun Wu, Naveen Iyer, Hongning Wang, ``Learning Contextual Bandits in a Non-stationary Environment,'' Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (https://arxiv.org/pdf/1805.09365.pdf)
Methods
action_spec
action_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the actions that should be provided to step()
.
May use a subclass of ArraySpec
that specifies additional properties such
as min and max bounds on the values.
Returns | |
---|---|
An ArraySpec , or a nested dict, list or tuple of ArraySpec s.
|
close
close() -> None
Frees any resources used by the environment.
Implement this method for an environment backed by an external process.
This method be used directly
env = Env(...)
# Use env.
env.close()
or via a context manager
with Env(...) as env:
# Use env.
current_time_step
current_time_step() -> tf_agents.trajectories.TimeStep
Returns the current timestep.
discount_spec
discount_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the discount that are returned by step()
.
Override this method to define an environment that uses non-standard discount values, for example an environment with array-valued discounts.
Returns | |
---|---|
An ArraySpec , or a nested dict, list or tuple of ArraySpec s.
|
get_info
get_info() -> tf_agents.typing.types.NestedArray
Returns the environment info returned on the last step.
Returns | |
---|---|
Info returned by last call to step(). None by default. |
Raises | |
---|---|
NotImplementedError
|
If the environment does not use info. |
get_state
get_state() -> Any
Returns the state
of the environment.
The state
contains everything required to restore the environment to the
current configuration. This can contain e.g.
- The current time_step.
- The number of steps taken in the environment (for finite horizon MDPs).
- Hidden state (for POMDPs).
Callers should not assume anything about the contents or format of the
returned state
. It should be treated as a token that can be passed back to
set_state()
later.
Note that the returned state
handle should not be modified by the
environment later on, and ensuring this (e.g. using copy.deepcopy) is the
responsibility of the environment.
Returns | |
---|---|
state
|
The current state of the environment. |
observation_spec
observation_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the observations provided by the environment.
May use a subclass of ArraySpec
that specifies additional properties such
as min and max bounds on the values.
Returns | |
---|---|
An ArraySpec , or a nested dict, list or tuple of ArraySpec s.
|
render
render(
mode: Text = 'rgb_array'
) -> Optional[types.NestedArray]
Renders the environment.
Args | |
---|---|
mode
|
One of ['rgb_array', 'human']. Renders to an numpy array, or brings up a window where the environment can be visualized. |
Returns | |
---|---|
An ndarray of shape [width, height, 3] denoting an RGB image if mode is
rgb_array . Otherwise return nothing and render directly to a display
window.
|
Raises | |
---|---|
NotImplementedError
|
If the environment does not support rendering. |
reset
reset() -> tf_agents.trajectories.TimeStep
Starts a new sequence and returns the first TimeStep
of this sequence.
Returns | |
---|---|
A TimeStep namedtuple containing:
step_type: A StepType of FIRST .
reward: 0.0, indicating the reward.
discount: 1.0, indicating the discount.
observation: A NumPy array, or a nested dict, list or tuple of arrays
corresponding to observation_spec() .
|
reward_spec
reward_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the rewards that are returned by step()
.
Override this method to define an environment that uses non-standard reward values, for example an environment with array-valued rewards.
Returns | |
---|---|
An ArraySpec , or a nested dict, list or tuple of ArraySpec s.
|
seed
seed(
seed: tf_agents.typing.types.Seed
) -> Any
Seeds the environment.
Args | |
---|---|
seed
|
Value to use as seed for the environment. |
set_state
set_state(
state: Any
) -> None
Restores the environment to a given state
.
See definition of state
in the documentation for get_state().
Args | |
---|---|
state
|
A state to restore the environment to. |
should_reset
should_reset(
current_time_step: tf_agents.trajectories.TimeStep
) -> bool
Whether the Environmet should reset given the current timestep.
By default it only resets when all time_steps are LAST
.
Args | |
---|---|
current_time_step
|
The current TimeStep .
|
Returns | |
---|---|
A bool indicating whether the Environment should reset or not. |
step
step(
action: tf_agents.typing.types.NestedArray
) -> tf_agents.trajectories.TimeStep
Updates the environment according to the action and returns a TimeStep
.
If the environment returned a TimeStep
with StepType.LAST
at the
previous step the implementation of _step
in the environment should call
reset
to start a new sequence and ignore action
.
This method will start a new sequence if called after the environment
has been constructed and reset
has not been called. In this case
action
will be ignored.
If should_reset(current_time_step)
is True, then this method will reset
by itself. In this case action
will be ignored.
Args | |
---|---|
action
|
A NumPy array, or a nested dict, list or tuple of arrays
corresponding to action_spec() .
|
Returns | |
---|---|
A TimeStep namedtuple containing:
step_type: A StepType value.
reward: A NumPy array, reward value for this timestep.
discount: A NumPy array, discount in the range [0, 1].
observation: A NumPy array, or a nested dict, list or tuple of arrays
corresponding to observation_spec() .
|
time_step_spec
time_step_spec() -> tf_agents.trajectories.TimeStep
Describes the TimeStep
fields returned by step()
.
Override this method to define an environment that uses non-standard values
for any of the items returned by step()
. For example, an environment with
array-valued rewards.
Returns | |
---|---|
A TimeStep namedtuple containing (possibly nested) ArraySpec s defining
the step_type, reward, discount, and observation structure.
|
__enter__
__enter__()
Allows the environment to be used in a with-statement context.
__exit__
__exit__(
unused_exception_type, unused_exc_value, unused_traceback
)
Allows the environment to be used in a with-statement context.