Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed

you should read this article

If you are planning to go into data science, be it a graduate or a professional looking for a career change, or a manager in charge of establishing best practices, this article is for you.

Data science attracts a variety of different backgrounds. From my professional experience, I’ve worked with colleagues who were once:

Nuclear physicists
Post-docs researching gravitational waves
PhDs in computational biology
Linguists

just to name a few.

It is wonderful to be able to meet such a diverse set of backgrounds and I have seen such a variety of minds lead to the growth of a creative and effective data science function.

However, I have also seen one big downside to this variety:

Everyone has had different levels of exposure to key Software Engineering concepts, resulting in a patchwork of coding skills.

As a result, I have seen work done by some data scientists that is brilliant, but is:

Unreadable — you have no idea what they are trying to do.
Flaky — it breaks the moment someone else tries to run it.
Unmaintainable — code quickly becomes obsolete or breaks easily.
Un-extensible — code is single-use and its behaviour cannot be extended

which ultimately dampens the impact their work can have and creates all sorts of issues down the line.

So, in a series of articles, I plan to outline some core software engineering concepts that I have tailored to be necessities for data scientists.

They are simple concepts, but the difference between knowing them vs not knowing them clearly draws the line between amateur and professional.

Abstract Art, Photo by Steve Johnson on Unsplash

Today’s concept: Abstract classes

Abstract classes are an extension of class inheritance, and it can be a very useful tool for data scientists if used correctly.

If you need a refresher on class inheritance, see my article on it here.

Like we did for class inheritance, I won’t bother with a formal definition. Looking back to when I first started coding, I found it hard to decipher the vague and abstract (no pun intended) definitions out there in the Internet.

It’s much easier to illustrate it by going through a practical example.

So, let’s go straight into an example that a data scientist is likely to encounter to demonstrate how they are used, and why they are useful.

Example: Preparing data for ingestion into a feature generation pipeline

Let’s say we are a consultancy that specialises in fraud detection for financial institutions.

We work with a number of different clients, and we have a set of features that carry a consistent signal across different client projects because they embed domain knowledge gathered from subject matter experts.

So it makes sense to build these features for each project, even if they are dropped during feature selection or are replaced with bespoke features built for that client.

The challenge

We data scientists know that working across different projects/environments/clients means that the input data for each one is never the same;

Clients may provide different file types: CSV, Parquet, JSON, tar, to name a few.
Different environments may require different sets of credentials.
Most definitely each dataset has their own quirks and so each one requires different data cleaning steps.

Therefore, you may think that we would need to build a new feature generation pipeline for each and every client.

How else would you handle the intricacies of each dataset?

No, there is a better way

Given that:

We know we’re going to be building the same set of useful features for each client
We can build one feature generation pipeline that can be reused for each client
Thus, the only new problem we need to solve is cleaning the input data.

Thus, our problem can be formulated into the following stages:

Image by author. Blue circles are datasets, yellow squares are pipelines.

Data Cleaning pipeline
- Responsible for handling any unique cleaning and processing that is required for a given client in order to format the dataset into a standardised schema dictated by the feature generation pipeline.
The Feature Generation pipeline
- Implements the feature engineering logic assuming the input data will follow a fixed schema to output our useful set of features.

Given a fixed input data schema, building the feature generation pipeline is trivial.

Therefore, we have boiled down our problem to the following:

How do we ensure the quality of the data cleaning pipelines such that their outputs always adhere to the downstream requirements?

The real problem we are solving

Our problem of ‘ensuring the output always adhere to downstream requirements’ is not just about getting code to run. That’s the easy part.

The hard part is designing code that is robust to a myriad of external, non-technical factors such as:

Human error
- People naturally forget small details or prior assumptions. They may build a data cleaning pipeline whilst overlooking certain requirements.
Leavers
- Over time, your team inevitably changes. Your colleagues may have knowledge that they assumed to be obvious, and therefore they never bothered to document it. Once they have left, that knowledge is lost. Only through trial and error, and hours of debugging will your team ever recover that knowledge.
New joiners
- Meanwhile, new joiners have no knowledge about prior assumptions that were once assumed obvious, so their code usually requires a lot of debugging and rewriting.

This is where abstract classes really shine.

Input data requirements

We mentioned that we can fix the schema for the feature generation pipeline input data, so let’s define this for our example.

Let’s say that our pipeline expects to read in parquet files, containing the following columns:

row_id:
    int, a unique ID for every transaction.
timestamp:
    str, in ISO 8601 format. The timestamp a transaction was made.
amount: 
    int, the transaction amount denominated in pennies (for our US readers, the equivalent will be cents).
direction: 
    str, the direction of the transaction, one of ['OUTBOUND', 'INBOUND']
account_holder_id: 
    str, unique identifier for the entity that owns the account the transaction was made on.
account_id: 
    str, unique identifier for the account the transaction was made on.

Let’s also add in a requirement that the dataset must be ordered by timestamp.

The abstract class

Now, time to define our abstract class.

An abstract class is essentially a blueprint from which we can inherit from to create child classes, otherwise named ‘concrete‘ classes.

Let’s spec out the different methods we may need for our data cleaning blueprint.

import os
from abc import ABC, abstractmethod

class BaseRawDataPipeline(ABC):
    def __init__(
        self,
        input_data_path: str | os.PathLike,
        output_data_path: str | os.PathLike
    ):
        self.input_data_path = input_data_path
        self.output_data_path = output_data_path

    @abstractmethod
    def transform(self, raw_data):
        """Transform the raw data.
        
        Args:
            raw_data: The raw data to be transformed.
        """
        ...

    @abstractmethod
    def load(self):
        """Load in the raw data."""
        ...

    def save(self, transformed_data):
        """save the transformed data."""
        ...

    def validate(self, transformed_data):
        """validate the transformed data."""
        ...

    def run(self):
        """Run the data cleaning pipeline."""
        ...

You can see that we have imported the ABC class from the abc module, which allows us to create abstract classes in Python.

Image by author. Diagram of the abstract class and concrete class relationships and methods.

Pre-defined behaviour

Image by author. The methods to be pre-defined are circled red.

Let’s now add some pre-defined behaviour to our abstract class.

Remember, this behaviour will be made available to all child classes which inherit from this class so this is where we bake in behaviour that you want to enforce for all future projects.

For our example, the behaviour that needs fixing across all projects are all related to how we output the processed dataset.

1. The `run` method

First, we define the run method. This is the method that will be called to run the data cleaning pipeline.

    def run(self):
        """Run the data cleaning pipeline."""
        inputs = self.load()
        output = self.transform(*inputs)
        self.validate(output)
        self.save(output)

The run method acts as a single point of entry for all future child classes.

This standardises how any data cleaning pipeline will be run, which enables us to then build new functionality around any pipeline without worrying about the underlying implementation.

You can imagine how incorporating such pipelines into some orchestrator or scheduler will be easier if all pipelines are executed through the same run method, as opposed to having to handle many different names such as run, execute, process, fit, transform etc.

2. The `save` method

Next, we fix how we output the transformed data.

    def save(self, transformed_data:pl.LazyFrame):
        """save the transformed data to parquet."""
        transformed_data.sink_parquet(
            self.output_file_path,
        )

We’re assuming we will use `polars` for data manipulation, and the output is saved as `parquet` files as per our specification for the feature generation pipeline.

3. The `validate` method

Finally, we populate the validate method which will check that the dataset adheres to our expected output format before saving it down.

    @property
    def output_schema(self):
        return dict(
            row_id=pl.Int64,
            timestamp=pl.Datetime,
            amount=pl.Int64,
            direction=pl.Categorical,
            account_holder_id=pl.Categorical,
            account_id=pl.Categorical,
        )
    
    def validate(self, transformed_data):
        """validate the transformed data."""
        schema = transformed_data.collect_schema()
        assert (
            self.output_schema == schema, 
            f"Expected {self.output_schema} but got {schema}"
        )

We’ve created a property called output_schema. This ensures that all child classes will have this available, whilst preventing it from being accidentally removed or overridden if it was defined in, for example, __init__.

Project-specific behaviour

Image by author. Project specific methods that need to be overridden are circled red.

In our example, the load and transform methods are where project-specific behaviour will be held, so we leave them blank in the base class – the implementation is deferred to the future data scientist in charge of writing this logic for the project.

You will also notice that we have used the abstractmethod decorator on the transform and load methods. This decorator enforces these methods to be defined by a child class. If a user forgets to define them, an error will be raised to remind them to do so.

Let’s now move on to some example projects where we can define the transform and load methods.

Example project

The client in this project sends us their dataset as CSV files with the following structure:

event_id: str
unix_timestamp: int
user_uuid: int
wallet_uuid: int
payment_value: float
country: str

We learn from them that:

Each transaction is unique identified by the combination of event_id and unix_timestamp
The wallet_uuid is the equivalent identifier for the ‘account’
The user_uuid is the equivalent identifier for the ‘account holder’
The payment_value is the transaction amount, denominated in Pound Sterling (or Dollar).
The CSV file is separated by | and has no header.

The concrete class

Now, we implement the load and transform functions to handle the unique complexities outlined above in a child class of BaseRawDataPipeline.

Remember, these methods are all that need to be written by the data scientists working on this project. All the aforementioned methods are pre-defined so they need not worry about it, reducing the amount of work your team needs to do.

1. Loading the data

The load function is quite simple:

class Project1RawDataPipeline(BaseRawDataPipeline):

    def load(self):
        """Load in the raw data.
        
        Note:
            As per the client's specification, the CSV file is separated 
            by `|` and has no header.
        """
        return pl.scan_csv(
            self.input_data_path,
            sep="|",
            has_header=False
        )

We use polars’ scan_csv method to stream the data, with the appropriate arguments to handle the CSV file structure for our client.

2. Transforming the data

The transform method is also simple for this project, since we don’t have any complex joins or aggregations to perform. So we can fit it all into a single function.

class Project1RawDataPipeline(BaseRawDataPipeline):

    ...

    def transform(self, raw_data: pl.LazyFrame):
        """Transform the raw data.

        Args:
            raw_data (pl.LazyFrame):
                The raw data to be transformed. Must contain the following columns:
                    - 'event_id'
                    - 'unix_timestamp'
                    - 'user_uuid'
                    - 'wallet_uuid'
                    - 'payment_value'

        Returns:
            pl.DataFrame:
                The transformed data.

                Operations:
                    1. row_id is constructed by concatenating event_id and unix_timestamp
                    2. account_id and account_holder_id are renamed from user_uuid and wallet_uuid
                    3. transaction_amount is converted from payment_value. Source data
                    denomination is in £/$, so we need to convert to p/cents.
        """

        # select only the columns we need
        DESIRED_COLUMNS = [
            "event_id",
            "unix_timestamp",
            "user_uuid",
            "wallet_uuid",
            "payment_value",
        ]
        df = raw_data.select(DESIRED_COLUMNS)

        df = df.select(
            # concatenate event_id and unix_timestamp
            # to get a unique identifier for each row.
            pl.concat_str(
                [
                    pl.col("event_id"),
                    pl.col("unix_timestamp")
                ],
                separator="-"
            ).alias('row_id'),

            # convert unix timestamp to ISO format string
            pl.from_epoch("unix_timestamp", "s").dt.to_string("iso").alias("timestamp"),

            pl.col("user_uuid").alias("account_id"),
            pl.col("wallet_uuid").alias("account_holder_id"),

            # convert from £ to p
            # OR convert from $ to cents
            (pl.col("payment_value") * 100).alias("transaction_amount"),
        )

        return df

Thus, by overloading these two methods, we’ve implemented all we need for our client project.

The output we know conforms to the requirements of the downstream feature engineering pipeline, so we automatically have assurance that our outputs are compatible.

No debugging required. No hassle. No fuss.

Final summary: Why use abstract classes in data science pipelines?

Abstract classes offer a powerful way to bring consistency, robustness, and improved maintainability to data science projects. By using Abstract Classes like in our example, our data science team sees the following benefits:

1. No need to worry about compatibility

By defining a clear blueprint with abstract classes, the data scientist only needs to focus on implementing the load and transform methods specific to their client’s data.

As long as these methods conform to the expected input/output types, compatibility with the downstream feature generation pipeline is guaranteed.

This separation of concerns simplifies the development process, reduces bugs, and accelerates development for new projects.

2. Easier to document

The structured format naturally encourages in-line documentation through method docstrings.

This proximity of design decisions and implementation makes it easier to communicate assumptions, transformations, and nuances for each client’s dataset.

Well-documented code is easier to read, maintain, and hand over, reducing the knowledge loss caused by team changes or turnover.

3. Improved code readability and maintainability

With abstract classes enforcing a consistent interface, the resulting codebase avoids the pitfalls of unreadable, flaky, or unmaintainable scripts.

Each child class adheres to a standardized method structure (load, transform, validate, save, run), making the pipelines more predictable and easier to debug.

4. Robustness to human factors

Abstract classes help reduce risks from human error, teammates leaving, or learning new joiners by embedding essential behaviours in the base class. This ensures that critical steps are never skipped, even if individual contributors are unaware of all downstream requirements.

5. Extensibility and reusability

By isolating client-specific logic in concrete classes while sharing common behaviors in the abstract base, it becomes straightforward to extend pipelines for new clients or projects. You can add new data cleaning steps or support new file formats without rewriting the entire pipeline.

In summary, abstract classes levels up your data science codebase from ad-hoc scripts to scalable, and maintainable production-grade code. Whether you’re a data scientist, a team lead, or a manager, adopting these software engineering principles will significantly boost the impact and longevity of your work.

If you enjoyed this article, then have a look at some of my other related articles.

Inheritance: A software engineering concept data scientists must know to succeed (here)
Encapsulation: A softwre engineering concept data scientists must know to succeed (here)
The Data Science Tool You Need For Efficient ML-Ops (here)
DSLP: The data science project management framework that transformed my team (here)
How to stand out in your data scientist interview (here)
An Interactive Visualisation For Your Graph Neural Network Explanations (here)
The New Best Python Package for Visualising Network Graphs (here)

Source link

#Abstract #Classes #Software #Engineering #Concept #Data #Scientists #Succeed

Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed

you should read this article

Today’s concept: Abstract classes

Example: Preparing data for ingestion into a feature generation pipeline

The challenge

No, there is a better way

The real problem we are solving

Input data requirements

The abstract class

Pre-defined behaviour

1. The `run` method

2. The `save` method

3. The `validate` method

Project-specific behaviour

Example project

The concrete class

1. Loading the data

2. Transforming the data

Final summary: Why use abstract classes in data science pipelines?

1. No need to worry about compatibility

2. Easier to document

3. Improved code readability and maintainability

4. Robustness to human factors

5. Extensibility and reusability

Related articles:

Recent Posts

EDA in Public (Part 2): Product Deep Dive & Time-Series Analysis in Pandas

The evolution of expendability: Why some ants traded armor for numbers

Sony’s XM5 headphones and the latest Kindle round out this week’s best deals

Elon Says His New Rocket Is as Important as the Origin of Life Itself

Bose QuietComfort Ultra vs. Sony WH-1000XM6: I tried the two best ANC headphones, and this pair wins

Exploring the context of online images with Backstory

On Point Gets Major Revamp With VR Arcade Shooter’s Full Release

Understanding the Generative AI User | Towards Data Science

How Europe’s new carbon tax on imported goods will change global trade

The LCD Steam Deck is done

you should read this article

Today’s concept: Abstract classes

Example: Preparing data for ingestion into a feature generation pipeline

The challenge

No, there is a better way

The real problem we are solving

Input data requirements

The abstract class

Pre-defined behaviour

1. The run method

2. The save method

3. The validate method

Project-specific behaviour

Example project

The concrete class

1. Loading the data

2. Transforming the data

Final summary: Why use abstract classes in data science pipelines?

1. No need to worry about compatibility

2. Easier to document

3. Improved code readability and maintainability

4. Robustness to human factors

5. Extensibility and reusability

Related articles:

Recent Posts

1. The `run` method

2. The `save` method

3. The `validate` method