Ensemble Model based TV-RL
Ensemble Model based TV-RL
Motivation
In non stationary environment RL approaches need to adapt to new plant parameters. The context can be inferred by the past transitions. Given discrete system changes we can use Baysian context length inference to compute a probability distribution over the context length. We need a model to compute this distribution. MBRL also uses a model to generate new data. The first question is if we can combine Dyna Style MBRL with nonstationary RL. The second question is if we use simmilar ideas as in MACURA to stop model unrolls not only for different state spaces areas but also for different model believes.
Details
Problem setup
Normal RL setup (like mujoco halfcheetah) but the env (or later reward function) can change at any point in time (also while training)
Env ideas:
- halfcheetah with changing ground
- halfcheetah with changing target velocity and direction
Main idea
- Combine Dyna style MBRL with non stationary environments with picewise stationary context .
- First: generate context vector based on past transitions to condition model , value network and policy.
- Use the model (used in dyna style MBRL) to generate context length believe
- Keep track of a distribution of possible context vectors . Sample from this adapt the policy to unseen plant parameter.
- Use MACURA style ideas to stop unrolls for context believes that the model is uncertain about.
Open questions
- When we use sampled context encodings to generate data to train e.g. SAC, how can we make sure that we get this encoding if we encounter it in the real world aka this transitions.
- Which context encoding ist best? For markov assumption or without.
- CNP like
- Graph based
- RNN
- e.g. with autoencoding like in chenAdaptiveDeepRL2022
Related work
- Continuous Meta-Learning without tasks
- An Adaptive Deep RL Method for Non-Stationary Environments with Piecewise Stable context
- Trust the Model Where It Trusts Itself – Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption
Project Outline
- Implement MBRL
- Implement non stationary enviroments
- Combine MBRL with baysian context length belive
- Sample already visited contexts
- Find way to sample new contexts
- Combine with MACURA
- Evaluation
- Compare impact of context extrapolation
- Compare impact of stoping if uncertain