...

Journey to Full-Stack Data Scientist: Model Deployment | by Alex Davis | Jan, 2025


First, for our example, we need to develop a model. Since this article focuses on model deployment, we will not worry about the performance of the model. Instead, we will build a simple model with limited features to focus on learning model deployment.

In this example, we will predict a data professional’s salary based on a few features, such as experience, job title, company size, etc.

See data here: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries (CC0: Public Domain). I slightly modified the data to reduce the number of options for certain features.

#import packages for data manipulation
import pandas as pd
import numpy as np

#import packages for machine learning
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error, r2_score

#import packages for data management
import joblib

First, let’s take a look at the data.

Image by Author

Since all of our features are categorical, we will use encoding to transform our data to numerical. Below, we use ordinal encoders to encode experience level and company size. These are ordinal because they represent some kind of progression (1 = entry level, 2 = mid-level, etc.).

For job title and employment type, we will create a dummy variables for each option (note we drop the first to avoid multicollinearity).

#use ordinal encoder to encode experience level
encoder = OrdinalEncoder(categories=[['EN', 'MI', 'SE', 'EX']])
salary_data['experience_level_encoded'] = encoder.fit_transform(salary_data[['experience_level']])

#use ordinal encoder to encode company size
encoder = OrdinalEncoder(categories=[['S', 'M', 'L']])
salary_data['company_size_encoded'] = encoder.fit_transform(salary_data[['company_size']])

#encode employmeny type and job title using dummy columns
salary_data = pd.get_dummies(salary_data, columns = ['employment_type', 'job_title'], drop_first = True, dtype = int)

#drop original columns
salary_data = salary_data.drop(columns = ['experience_level', 'company_size'])

Now that we have transformed our model inputs, we can create our training and test sets. We will input these features into a simple linear regression model to predict the employee’s salary.

#define independent and dependent features
X = salary_data.drop(columns = 'salary_in_usd')
y = salary_data['salary_in_usd']

#split between training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state = 104, test_size = 0.2, shuffle = True)

#fit linear regression model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

#make predictions
y_pred = regr.predict(X_test)

#print the coefficients
print("Coefficients: \n", regr.coef_)

#print the MSE
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

#print the adjusted R2 value
print("R2: %.2f" % r2_score(y_test, y_pred))

Let’s see how our model did.

Image by Author

Looks like our R-squared is 0.27, yikes. A lot more work would need to be done with this model. We would likely need more data and additional information on the observations. But for the sake of this article, we will move forward and save our model.

#save model using joblib
joblib.dump(regr, 'lin_regress.sav')

Source link

#Journey #FullStack #Data #Scientist #Model #Deployment #Alex #Davis #Jan