The current state is represented by a tuple (alpha, beta), where: alpha is the current on-hand inventory (items in stock), beta is the current on-order inventory (items ordered but not yet received), init_inv calculates the total initial inventory by summing alpha and beta.

Then, we need to simulate customer demand using Poisson distribution with lambda value “self.poisson_lambda”. Here, the demand shows the randomness of customer demand:

`alpha, beta = state`

init_inv = alpha + beta

demand = np.random.poisson(self.poisson_lambda)

**Note**: Poisson distribution is used to model the demand, which is a common choice for modeling random events like customer arrivals. However, we can either train the model with historical demand data or live interaction with environment in real time. In its core, reinforcement learning is about learning from the data, and it does not require prior knowledge of a model.

Now, the “next alpha” which is in-hand inventory can be written as max(0,init_inv-demand). What that means is that if demand is more than the initial inventory, then the new alpha would be zero, if not, init_inv-demand.

The **cost** comes in two parts. **Holding cost**: is calculated by multiplying the number of bikes in the store by the per-unit holding cost. Then, we have another cost, which is **stockout cost**. It is a cost that we need to pay for the cases of missed demand. These two parts form the “reward” which we try to maximize using reinforcement learning method.( a better way to put is we want to minimize the cost, so we maximize the reward).

`new_alpha = max(0, init_inv - demand)`

holding_cost = -new_alpha * self.holding_cost

stockout_cost = 0if demand > init_inv:

stockout_cost = -(demand - init_inv) * self.stockout_cost

reward = holding_cost + stockout_cost

next_state = (new_alpha, action)

## Exploration — Exploitation in Q-Learning

Choosing action in the Q-learning method involves some degree of exploration to get an overview of the Q value for all the states in the Q table. To do that, at every action chosen, there is an epsilon chance that we take an exploration approach and “randomly” select an action, whereas, with a 1-ϵ chance, we take the best action possible from the Q table.

`def choose_action(self, state):`# Epsilon-greedy action selection

if np.random.rand() < self.epsilon:

return np.random.choice(self.user_capacity - (state[0] + state[1]) + 1)

else:

return max(self.Q[state], key=self.Q[state].get)

## Training RL Agent

The training of the RL agent is done by the “train” function, and it is follow as: First, we need to initialize the Q (empty dictionary structure). Then, experiences are collected in each batch (self.batch.append((state, action, reward, next_state))), and the Q table is updated at the end of each batch (self.update_Q(self.batch)). The number of episodes is limited to “max_actions_per_episode” in each batch. The number of episodes is the number of times the agent interacts with the environment to learn the optimal policy.

Each episode starts with a randomly assigned state, and while the number of actions is lower than max_actions_per_episode, the collecting data for that batch continues.

`def train(self):`self.Q = self.initialize_Q() # Reinitialize Q-table for each training run

for episode in range(self.episodes):

alpha_0 = random.randint(0, self.user_capacity)

beta_0 = random.randint(0, self.user_capacity - alpha_0)

state = (alpha_0, beta_0)

#total_reward = 0

self.batch = [] # Reset the batch at the start of each episode

action_taken = 0

while action_taken < self.max_actions_per_episode:

action = self.choose_action(state)

next_state, reward = self.simulate_transition_and_reward(state, action)

self.batch.append((state, action, reward, next_state)) # Collect experience

state = next_state

action_taken += 1

self.update_Q(self.batch) # Update Q-table using the batch

Source link

#Optimizing #Inventory #Management #Reinforcement #Learning #Handson #Python #Guide #Peyman #Kor #Oct

Unlock the potential of cutting-edge AI solutions with our comprehensive offerings. As a leading provider in the AI landscape, we harness the power of artificial intelligence to revolutionize industries. From machine learning and data analytics to natural language processing and computer vision, our AI solutions are designed to enhance efficiency and drive innovation. Explore the limitless possibilities of AI-driven insights and automation that propel your business forward. With a commitment to staying at the forefront of the rapidly evolving AI market, we deliver tailored solutions that meet your specific needs. Join us on the forefront of technological advancement, and let AI redefine the way you operate and succeed in a competitive landscape. Embrace the future with AI excellence, where possibilities are limitless, and competition is surpassed.