EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas

! Welcome to the start of a major data journey that I’m calling “EDA in Public.” For those who know me, I believe the best way to learn anything is to tackle a real-world problem and share the entire messy process — including mistakes, victories, and everything in between. If you’ve been looking to level up your Pandas and data analysis skills, this is the series for you.

We’re going to act as data analysts for a fictional, mid-sized e-commerce company I’ll call NovaShop. They handed us a raw, messy sales CSV and asked a simple question: “How are we doing?”

The goal of Part 1 is foundational: we’ll clean this challenging e-commerce dataset, explore its basic structure, and master the core EDA skills in Pandas that every data scientist uses on a daily basis. This series is structured to take you from a beginner (Part 1) to an advanced data analyst (Part 3), so feel free to jump in wherever you’re at.

Before we jump into the code, let’s nail down our core motivation. For NovaShop, we need to answer some simple, yet powerful questions: Which products drive the most revenue? Which countries generate the most sales? Let’s find out.

Dataset Overview: Unpacking the Sales Data

To start our analysis for NovaShop, we’ll be using the UCI Online Retail Dataset. This is a great, highly realistic, non-streamlined dataset that captures all transactions for a UK-based non-store online retail company between late 2010 and late 2011.

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

The dataset contains over half a million rows, and it’s full of the kinds of anomalies you encounter in the real world — missing values, negative numbers, and inconsistent text formatting. This is exactly what we want!

Here are the eight key columns we’ll be working with, and what they tell us from a business perspective:

InvoiceNo: This is the Invoice number. A unique 6-digit number assigned to each transaction. If the code starts with ‘C’, it indicates a cancellation (a return).
StockCode/Product code: A unique 5-digit code assigned to each distinct product.
Description: The name of the item. This often needs cleaning (extra spaces, inconsistent cases).
Quantity: Number of items purchased. How many units were involved per transaction? Can be negative for returns.
InvoiceDate: The date and time of the transaction. This is essential for time-series analysis later on.
UnitPrice: Product price per unit in Sterling (GBP). The price of one item. It can sometimes be 0 or negative due to errors/free items.
CustomerID: A unique 5-digit number assigned to each registered customer. Crucially, this is often missing, meaning we have many transactions from guests.
Country: Name of the country where the customer resides. This would be great for segmenting international sales.

Let’s take a quick peek at the first few rows to see what we’re dealing with. This is the output of df.head():

Let’s import the dataset into Pandas and see how many rows we’re dealing with.

import pandas as pd
import numpy as np
df = pd.read_csv(‘Online Retail.csv’)
df.shape

Output:

(541909, 8)

That’s quite a lot of rows. I might have to slice the rows a bit.

Data Loading and Slicing: Dealing with Volume

For Part 1, we’re going to take a 10% random sample of the full dataset. This gives us a much more manageable size — around 54,000 rows — while maintaining the characteristics of the original data. This is a common and practical technique when dealing with Big Data environments or when focusing on prototyping.

# Data Loading and Slicing
FILE_PATH = ‘Online Retail.csv’
SAMPLE_FRACTION = 0.1 # We will sample 10% of the data
full_df = pd.read_csv(FILE_PATH, encoding=’unicode_escape’)

# Take a random 10% sample for faster processing in Part 1
df = full_df.sample(frac=SAMPLE_FRACTION, random_state=42).reset_index(drop=True)

Now, let’s check out the shape again using the df.shape

Output:

(54191, 8)

Perfect! Now we can begin.

Initial Data Inspection: Getting Our Hands Dirty

My first step will be to discover what we’re cleaning. I need to answer these questions

What does the dataset contain?
What are the data types and non-null counts?
What do the numbers look like?

Visual Check: df.head() and df.tail()

Looking at the first and last few rows, I could confirm a few things:

Cancellations & Returns: I happened to notice some InvoiceNo values starting with ‘C’, and the corresponding Quantity is negative. This confirms returns are included, and for revenue analysis, I’ll need to separate or exclude them.
Missingness: I could visually see NaN values in the CustomerID column, confirming the mass of guest transactions.
Inconsistent Text: I checked the Description column for whitespaces. Based on the small sample I got, I couldn’t really tell. But it doesn’t hurt to address them when I go into data cleaning. I might also keep the capitalisation consistent. It’s always best practice to strip all leading/trailing whitespace from all string columns to prevent subtle errors when grouping.
I also noticed some capitalisation inconsistencies in the Country column. I spotted a country named EIRE. It probably means Ireland, might need to change that.

What Are the Data Types and Non-Null Counts? (`.info()`)

The next essential step is to check the structure using the .info() method. This tells me the data type of each column and, most importantly, how many non-null (not missing) values we have.

Key Findings

Missing Values: After taking our 10% sample, I could see a massive gap in CustomerID. Roughly 25% of the customer IDs are missing. I also noticed a few hundred missing values in Description that I’ll need to address.
Data Types: The InvoiceDate column is still listed as an object (string) type. Gotta convert this to a proper Pandas datetime object.CustomerID is also currently a float, likely because it contains those NaN values! This is a little detail I’ll need to remember if I ever want to use it as a true integer ID.

What Do the Numbers Look Like? (`.describe()`)

Next, I used .describe() to get a quick statistical summary of all the numerical columns (Quantity and UnitPrice).

Quantity (Quantity): The minimum quantity is -2472. This confirms that returns exist, but the scale of those minimum points suggests an extreme outlier transaction. For basic sales analysis, I might need to filter out these negative numbers and potentially the extreme positive and negative outliers.
Unit Price (UnitPrice): The minimum price is 0. This means some products were given away for free or are placeholder entries. Since a product should have a positive price in a normal sale, filtering out any rows where UnitPrice is zero or less is always best practice to calculate revenue accurately.

Based on these quick inspections performed. This data is far from clean. We have massive missingness, incorrect data types, highly problematic negative/zero values in our core numerical columns, and text inconsistencies to tackle.

Handling Missing Values

Now that we know where our data is missing, we can begin discussing strategies for handling null values. I’m mostly concerned with Description and CustomerID.

Missing Descriptions

The Description tells us what product was sold, so losing it renders the transaction meaningless. In our sample, less than 1% of the rows have descriptions missing. Dropping those rows completely makes sense, as they are useless for analysis at the product level.

# Drop rows where the Description is missing
df.dropna(subset=['Description'], inplace=True)
# checking for null counts
df['Description'].isnull().sum()

Output:

np.int64(0)

Alright, perfect! All null values are gone.

Missing Customer IDs

The missing CustomerIDs (around 25% of our sample) are a much bigger deal. If I drop all of them, I’ll lose almost a quarter of our sales data, which will give NovaShop a dangerously skewed view of its total revenue.

For simple revenue and product analysis (the goal of Part 1), I don’t really need the CustomerID. We can proceed with the analysis on all rows even without an ID. We only need the ID if we plan on doing customer segmentation (like RFM analysis), which I’ll save for Part 2!

Data Type Conversion and Duplicate Removal

Now that we’ve handled the missing descriptions, the next two immediate issues are dealing with fully duplicated rows and fixing our essential InvoiceDate column.

Fixing the Invoice Date Type

Remember how .info() showed InvoiceDate as a string (object)? We need to fix this immediately so Pandas knows how to sort and aggregate our data chronologically.

# Convert InvoiceDate from object (string) to datetime
df[‘InvoiceDate’] = pd.to_datetime(df[‘InvoiceDate’])

Removing Duplicates

If two rows are identical across all columns, they are duplicates and will artificially inflate our sales numbers. Let’s check for them and remove them.

# Remove Duplicates
num_duplicates = df.duplicated().sum()
print(f”Found {num_duplicates} fully duplicated rows.”)

Output:

Found 62 fully duplicated rows

Let’s drop them

# Remove duplicates, keeping the first instance
df.drop_duplicates(inplace=True)
print(f”Dataframe size after removing duplicates: {len(df)} rows.”)

Output:

Dataframe size after removing duplicates: 53980 rows.

Filtering Out Returns and Errors

Our .describe() method showed us some serious red flags: negative quantities (returns) and zero/negative unit prices (errors/free items). For Part 1, we want to calculate net revenue from sales, so we must filter this noise out.

Handling Negative Quantities and Prices

We will keep only the transactions where:

Quantity is strictly greater than zero (filtering out all returns and cancellations).
UnitPrice is strictly greater than zero (filtering out errors and free items).

# Filter: Keep only transactions where Quantity is positive (i.e., sales, not returns)
df = df[df[‘Quantity’] > 0]

# Filter: Keep only transactions where UnitPrice is positive (i.e., not free or an error)
df = df[df[‘UnitPrice’] > 0]
print(f”Dataframe size after filtering: {len(df)} rows.”)

Output

Dataframe size after filtering: 52933 rows.

Cleaning Text Columns

Finally, let’s tackle those text standardization issues we spotted visually, like the EIRE country code and any potential hidden whitespace in Description.

Strip Whitespace: We use .str.strip() to remove leading/trailing spaces from both Description and Country.
Standardize Country: We manually map inconsistencies like ‘EIRE’ to ‘Ireland’.

# Cleaning Text Columns
# Clean Description and Country columns
df[‘Description’] = df[‘Description’].str.strip()
df[‘Country’] = df[‘Country’].str.strip()

# Handle specific country name inconsistencies
# EIRE is a common inconsistency in this dataset for Ireland
df[‘Country’].replace(‘EIRE’, ‘Ireland’, inplace=True)
print(“Text columns cleaned and standardized.”)

Feature Engineering & First Insight

The data is now clean. Now we can finally start asking the fun questions. The single most important metric for any sales dataset is revenue. Since our original data only has Quantity and UnitPrice, we need to engineer the Revenue column ourselves.

Feature Engineering: Creating the `Revenue` Column

Revenue for a transaction is simply the number of items sold multiplied by the price of each item.

df[‘Revenue’] = df[‘Quantity’] * df[‘UnitPrice’]

First Insight: Which Countries Drive Revenue?

Let’s use our clean data to answer NovaShop’s question: “Which countries are driving our sales?”
We’ll use a powerful three-step combination:

Group by the Country column.
Aggregate (sum) the Revenue within each group.
Sort the results from highest revenue to lowest.

# Group by Country, sum the Revenue, and sort for the top 10
top_countries = df.groupby(‘Country’)[‘Revenue’].sum().sort_values(ascending=False).head(10)
print(“\n — — Top 10 Countries by Revenue (GBP) — -”)
print(top_countries)

Output:

--- Top 10 Countries by Revenue (GBP) ---
Country
United Kingdom 941268.661
Netherlands 27435.830
EIRE 26066.000
France 23645.330
Germany 22389.510
Australia 12429.990
Spain 5600.900
Switzerland 5483.890
Hong Kong 3597.850
Belgium 3593.510
Name: Revenue, dtype: float64

The output for this typically shows a massive dominance by the United Kingdom. This is expected since the company is UK-based. The next several countries give us a quick roadmap of where NovaShop’s international focus should be.

Conclusion

We did it. We took a raw, half-a-million-row dataset, sliced it for manageability, battled missing values, fixed data types, filtered out errors, and calculated our first key business metric. The hard, foundational work is done.

Here’s a quick recap of what we conquered in Part 1:

Data Vetting: We used .head(), .info(), and .describe() to identify crucial issues like the negative prices/quantities, missing Customer IDs, and the incorrect datetime format.
Data Cleaning: We systematically removed nulls and duplicates, converted InvoiceDate to a proper datetime object, and filtered out non-sale transactions (returns and free items).
Feature Engineering: We created the critical Revenue column.
First Insight: We generated the Top 10 Countries by Revenue for NovaShop, giving them their first actionable data point from the clean set.

The clean dataset is now primed for more sophisticated analysis. In Part 2, we’re going to dive deep into Product Analysis and Time-Series Decomposition. We’ll figure out which items are the true money-makers and analyse how sales volume changes hour-by-hour and month-by-month.

I’m excited to keep going! If you enjoyed this article. Feel free to let me know on any of these channels. Your feedback will mean a lot to me.

Medium

Twitter

YouTube

Source link

#EDA #Public #Part #Cleaning #Exploring #Sales #Data #withPandas