• About
  • Advertise
  • Privacy & Policy
  • Contact
Friday, January 2, 2026
  • Login
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home Machine Learning

Building a Regression Model to Predict Delivery Durations: A Practical Guide | by Jimin Kang | Dec, 2024

AiNEWS2025 by AiNEWS2025
2025-01-27
in Machine Learning
0
Building a Regression Model to Predict Delivery Durations: A Practical Guide | by Jimin Kang | Dec, 2024
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Data Preparation & Exploratory Analysis

Now that we’ve outlined our approach, let’s take a look at our data and what kind of features we’re working with.

From the above, we see our data contains ~197,000 deliveries, with a variety of numeric & non-numeric features. None of the features are missing a large percentage of values (lowest non-null count ~181,000), so we likely won’t have to worry about dropping any features entirely.

Let’s check if our data contains any duplicated deliveries, and if there are any observations that we cannot compute delivery time for.

print(f"Number of duplicates: {df.duplicated().sum()} \n")

print(pd.DataFrame({'Missing Count': df[['created_at', 'actual_delivery_time']].isna().sum()}))

We see that all the deliveries are unique. However, there are 7 deliveries that are missing a value for actual_delivery_time, which means we won’t be able to compute the delivery duration for these orders. Since there’s only a handful of these, we’ll remove these observations from our data.

Now, let’s create our prediction target. We want to predict the delivery duration (in seconds), which is the elapsed time between when the customer placed the order (‘created_at’) and when they recieved the order (‘actual_delivery_time’).

# convert columns to datetime 
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'], utc=True)

# create prediction target
df['seconds_to_delivery'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds()

The last thing we’ll do before splitting our data into train/test is check for missing values. We already viewed the non-null counts for each feature above, but let’s view the proportions to get a better picture.

We see that the market features (‘onshift_dashers’, ‘busy_dashers’, ‘outstanding_orders’) have the highest percentage of missing values (~8% missing). The feature with the second-highest missing data rate is ‘store_primary_category’ (~2%). All other features have < 1% missing.

Since none of the features have a high missing count, we won’t remove any of them. Later on, we will look at the feature distributions to help us decide how to appropriately deal with missing observations for each feature.

But first, let’s split our data into train/test. We will proceed with an 80/20 split, and we’ll write this test data to a separate file which we won’t touch until evaluating our final model.

from sklearn.model_selection import train_test_split
import os

# shuffle
df = df.sample(frac=1, random_state=42)
df = df.reset_index(drop=True)

# split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# write test data to separate file
directory = 'datasets'
file_name = 'test_data.csv'
file_path = os.path.join(directory, file_name)
os.makedirs(directory, exist_ok=True)
test_df.to_csv(file_path, index=False)

Now, let’s dive into the specifics of our train data. We’ll establish our numeric & categorical features, to make it clear which columns are being referenced in later exploratory steps.

categorical_feats = [
'market_id',
'store_id',
'store_primary_category',
'order_protocol'
]

numeric_feats = [
'total_items',
'subtotal',
'num_distinct_items',
'min_item_price',
'max_item_price',
'total_onshift_dashers',
'total_busy_dashers',
'total_outstanding_orders',
'estimated_order_place_duration',
'estimated_store_to_consumer_driving_duration'
]

Let’s revisit the categorical features with missing values (‘market_id’, ‘store_primary_category’, ‘order_protocol’). Since there was little missing data among those features (< 3%), we will simply impute those missing values with an “unknown” category.

  • This way, we won’t have to remove data from other features.
  • Perhaps the absence of feature values holds some predictive power for delivery duration i.e. these features are not missing at random.
  • Additionally, we will add this imputation step to our preprocessing pipeline during modeling, so that we won’t have to manually duplicate this work on our test set.
missing_cols_categorical = ['market_id', 'store_primary_category', 'order_protocol']

train_df[missing_cols_categorical] = train_df[missing_cols_categorical].fillna("unknown")

Let’s look at our categorical features.

pd.DataFrame({'Cardinality': train_df[categorical_feats].nunique()}).rename_axis('Feature')

Since ‘market_id’ & ‘order_protocol’ have low cardinality, we can visualize their distributions easily. On the other hand, ‘store_id’ & ‘store_primary_category’ are high cardinality features. We’ll take a deeper look at those later.

import seaborn as sns
import matplotlib.pyplot as plt

categorical_feats_subset = [
'market_id',
'order_protocol'
]

# Set up the grid
fig, axes = plt.subplots(1, len(categorical_feats_subset), figsize=(13, 5), sharey=True)

# Create barplots for each variable
for i, col in enumerate(categorical_feats_subset):
sns.countplot(x=col, data=train_df, ax=axes[i])
axes[i].set_title(f"Frequencies: {col}")

# Adjust layout
plt.tight_layout()
plt.show()

Some key things to note:

  • ~70% of orders placed have ‘market_id’ of 1, 2, 4
  • < 1% of orders have ‘order_protocol’ of 6 or 7

Unfortunately, we don’t have any additional information about these variables, such as which ‘market_id’ values are associated with which cities/locations, and what each ‘order_protocol’ number represents. At this point, asking for additional data concerning this information may be a good idea, as it may help for investigating trends in delivery duration across broader region/location categorizations.

Let’s look at our higher cardinality categorical features. Perhaps each ‘store_primary_category’ has an associated ‘store_id’ range? If so, we may not need ‘store_id’, as ‘store_primary_category’ would already encapsulate a lot of the information about the store being ordered from.

store_info = train_df[['store_id', 'store_primary_category']]

store_info.groupby('store_primary_category')['store_id'].agg(['min', 'max'])

Clearly not the case: we see that ‘store_id’ ranges overlap across levels of ‘store_primary_category’.

A quick look at the distinct values and associated frequencies for ‘store_id’ & ‘store_primary_category’ shows that these features have high cardinality and are sparsely distributed. In general, high cardinality categorical features may be problematic in regression tasks, particularly for regression algorithms that require solely numeric data. When these high cardinality features are encoded, they may enlarge the feature space drastically, making the available data sparse and decreasing the model’s ability to generalize to new observations in that feature space. For a better & more professional explanation of the phenomena, you can read more about it here.

Let’s get a sense of how sparsely distributed these features are.

store_id_values = train_df['store_id'].value_counts()

# Plot the histogram
plt.figure(figsize=(8, 5))
plt.bar(store_id_values.index, store_id_values.values, color='skyblue')

# Add titles and labels
plt.title('Value Counts: store_id', fontsize=14)
plt.xlabel('store_id', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

We see that there are a handful of stores that have hundreds of orders, but the majority of them have much less than 100.

To handle the high cardinality of ‘store_id’, we’ll create another feature, ‘store_id_freq’, that groups the ‘store_id’ values by frequency.

  • We’ll group the ‘store_id’ values into five different percentile bins shown below.
  • ‘store_id_freq’ will have much lower cardinality than ‘store_id’, but will retain relevant information regarding the popularity of the store the delivery was ordered from.
  • For more inspiration behind this logic, check out this thread.
def encode_frequency(freq, percentiles) -> str:
if freq < percentiles[0]:
return '[0-50)'
elif freq < percentiles[1]:
return '[50-75)'
elif freq < percentiles[2]:
return '[75-90)'
elif freq < percentiles[3]:
return '[90-99)'
else:
return '99+'

value_counts = train_df['store_id'].value_counts()
percentiles = np.percentile(value_counts, [50, 75, 90, 99])

# apply encode_frequency to each store_id based on their number of orders
train_df['store_id_freq'] = train_df['store_id'].apply(lambda x: encode_frequency(value_counts[x], percentiles))

pd.DataFrame({'Count':train_df['store_id_freq'].value_counts()}).rename_axis('Frequency Bin')

Our encoding shows us that ~60,000 deliveries were ordered from stores catgorized in the 90–99th percentile in terms of popularity, whereas ~12,000 deliveries were ordered from stores that were in the 0–50th percentile in popularity.

Now that we’ve (attempted) to capture relevant ‘store_id’ information in a lower dimension, let’s try to do something similar with ‘store_primary_category’.

Let’s look at the most popular ‘store_primary_category’ levels.

A quick look shows us that many of these ‘store_primary_category’ levels are not exclusive to each other (ex: ‘american’ & ‘burger’). Further investigation shows many more examples of this kind of overlap.

So, let’s try to map these distinct store categories into a few basic, all-encompassing groups.

store_category_map = {
'american': ['american', 'burger', 'sandwich', 'barbeque'],
'asian': ['asian', 'chinese', 'japanese', 'indian', 'thai', 'vietnamese', 'dim-sum', 'korean',
'sushi', 'bubble-tea', 'malaysian', 'singaporean', 'indonesian', 'russian'],
'mexican': ['mexican'],
'italian': ['italian', 'pizza'],
}

def map_to_category_type(category: str) -> str:
for category_type, categories in store_category_map.items():
if category in categories:
return category_type
return "other"

train_df['store_category_type'] = train_df['store_primary_category'].apply(lambda x: map_to_category_type(x))

value_counts = train_df['store_category_type'].value_counts()

# Plot pie chart
plt.figure(figsize=(6, 6))
value_counts.plot.pie(autopct='%1.1f%%', startangle=90, cmap='viridis', labels=value_counts.index)
plt.title('Category Distribution')
plt.ylabel('') # Hide y-axis label for aesthetics
plt.show()

This grouping is probably brutally simple, and there may very well be a better way to group these store categories. Let’s proceed with it for now for simplicity.

We’ve done a good deal of investigation into our categorical features. Let’s look at the distributions for our numeric features.

# Create grid for boxplots
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15)) # Adjust figure size
axes = axes.flatten() # Flatten the 5x2 axes into a 1D array for easier iteration

# Generate boxplots for each numeric feature
for i, column in enumerate(numeric_feats):
sns.boxplot(y=train_df[column], ax=axes[i])
axes[i].set_title(f"Boxplot for {column}")
axes[i].set_ylabel(column)

# Remove any unused subplots (if any)
for i in range(len(numeric_feats), len(axes)):
fig.delaxes(axes[i])

# Adjust layout for better spacing
plt.tight_layout()
plt.show()

Boxplots for a subset of our numeric features

Many of the distributions appear to be more right skewed then they are due to the presence of outliers.

In particular, there seems to be an order with 400+ items. This seems strange as the next largest order is less than 100 items.

Let’s look more into that 400+ item order.

train_df[train_df['total_items']==train_df['total_items'].max()]

Source link

#Building #Regression #Model #Predict #Delivery #Durations #Practical #Guide #Jimin #Kang #Dec

Previous Post

FCC chair helps ISPs and landlords make deals that renters can’t escape

Next Post

Useful quantum computing is inevitable—and increasingly imminent

AiNEWS2025

AiNEWS2025

Next Post
Useful quantum computing is inevitable—and increasingly imminent

Useful quantum computing is inevitable—and increasingly imminent

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
Deep Reinforcement Learning: The Actor-Critic Method

Deep Reinforcement Learning: The Actor-Critic Method

2026-01-01
“Streaming stops feeling infinite”: What subscribers can expect in 2026

“Streaming stops feeling infinite”: What subscribers can expect in 2026

2026-01-01
Net neutrality was back, until it wasn’t

Net neutrality was back, until it wasn’t

2026-01-01
Man Operating Robot Accidentally Makes It Kick Him Directly in the Nutsack

Man Operating Robot Accidentally Makes It Kick Him Directly in the Nutsack

2026-01-01

Recent News

Deep Reinforcement Learning: The Actor-Critic Method

Deep Reinforcement Learning: The Actor-Critic Method

2026-01-01
“Streaming stops feeling infinite”: What subscribers can expect in 2026

“Streaming stops feeling infinite”: What subscribers can expect in 2026

2026-01-01
Net neutrality was back, until it wasn’t

Net neutrality was back, until it wasn’t

2026-01-01
Man Operating Robot Accidentally Makes It Kick Him Directly in the Nutsack

Man Operating Robot Accidentally Makes It Kick Him Directly in the Nutsack

2026-01-01
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

Deep Reinforcement Learning: The Actor-Critic Method

Deep Reinforcement Learning: The Actor-Critic Method

2026-01-01
“Streaming stops feeling infinite”: What subscribers can expect in 2026

“Streaming stops feeling infinite”: What subscribers can expect in 2026

2026-01-01
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.