Reinforcement learning (RL) has recently become very popular due to its use in relation to large language models (LLM). RL is defined as a set of algorithms centered around an agent learning to make decisions by interacting with an environment. The objective of learning process is to maximize rewards over time.
Each attempt by the agent to learn can affect the value function, which estimates the expected cumulative reward the agent can achieve starting from a specific state (or state-action pair) while following a particular policy. The policy itself serves as a guide to evaluate the desirability of different states or actions.
Conceptually the RL algorithm contains two steps, policy evaluation and policy improvement, which run iteratively to achieve the best attainable level of the value function. Within this post we limit our attention to the concept of normalization within policy evaluation framework.
Policy evaluation is closely related to the concept of state. A state represents the current situation or condition of the environment that the agent observes and uses to decide on the next action. The state is typically described by a set of variables whose values characterize the present conditions of the environment.
Source link
#Normalization #Crucial #Policy #Evaluation #Reinforcement #Learning #Lukasz #Gatarek #Jan