Menu

PySpark Lasso Regression – Building, Tuning, and Evaluating Lasso Regression with PySpark MLlib

Written by Jagdeesh | 4 min read

Lets explore how to build, tune, and evaluate a Lasso Regression model using PySpark MLlib, a powerful library for machine learning and data processing in Apache Spark.

Lasso regression is a popular machine learning algorithm that helps to identify the most important features in a dataset, allowing for more effective model building.

In this blog post, we’ll be discussing how to build and evaluate Lasso Regression models using PySpark MLlib, with a focus on hyperparameter tuning.

We will cover the following topics in this post:

  1. Setting up the environment

  2. Loading and preprocessing the data

  3. Creating a Lasso Regression model

  4. Hyperparameter tuning

  5. Evaluating the model

  6. Example code

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

python
import findspark
findspark.init()

from pyspark import SparkFiles
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

spark = SparkSession.builder \
    .appName("Lasso Regression with PySpark MLlib") \
    .getOrCreate()

2. Load the dataset

For this example, we will use the “Boston Housing” dataset. Save the dataset as a CSV file, and then use the following code to load the data into a PySpark DataFrame.

python
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
spark.sparkContext.addFile(url)

df = spark.read.csv(SparkFiles.get("BostonHousing.csv"), header=True, inferSchema=True)
df.show(5)
python
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|tax|ptratio|     b|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|0.00632|18.0| 2.31|   0|0.538|6.575|65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07|   0|0.469|7.185|61.1|4.9671|  2|242|   17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18|   0|0.458|6.998|45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18|   0|0.458|7.147|54.2|6.0622|  3|222|   18.7| 396.9| 5.33|36.2|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
only showing top 5 rows

3. Prepare the data

Before building the model, we need to assemble the input features into a single feature vector using the VectorAssembler class. Then, we will split the dataset into a training set (80%) and a testing set (20%).

python
# Define the feature and label columns & Assemble the feature vector
assembler = VectorAssembler(
    inputCols=["crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "b", "lstat"],
    outputCol="features")

data = assembler.transform(df)
final_data = data.select("features", "medv")

# Split the data into training and test sets
train_data, test_data = final_data.randomSplit([0.8, 0.2], seed=42)

4. Build the Lasso Regression Mode

Now that we’ve prepared the data, we can build the Lasso Regression model using PySpark MLlib’s LinearRegression class, setting the appropriate regularization parameter and elastic net mixing parameter.

Set the elasticNetParam to 1, which makes the model equivalent to Lasso Regression

Ridge Regression : ElasticNetParam set to 0, the model will use only L2 regularization (Ridge)

Lasso Regression : ElasticNetParam is set to 1, the model will use only L1 regularization (Lasso)

ElasticNet : ElasticNetParam is set to 0.5, the model will use a combination of L1 and L2 regularization

python
lasso_regression = LinearRegression(featuresCol="features", labelCol="medv", elasticNetParam=1)

5. Hyperparameter tuning

To find the optimal hyperparameters for our Lasso Regression model, we’ll perform a grid search using cross-validation. We’ll use the ParamGridBuilder and CrossValidator classes from PySpark MLlib

python
# Define the hyperparameter grid
param_grid = ParamGridBuilder() \
    .addGrid(lasso_regression.regParam, [0.001, 0.01, 0.1, 1.0]) \
    .build()

# Create the cross-validator
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol= "medv", metricName="rmse")
cross_validator = CrossValidator(estimator=lasso_regression,
                                 estimatorParamMaps=param_grid,
                                 evaluator=evaluator,
                                 numFolds=5)

# Train the model with the best hyperparameters
cv_model = cross_validator.fit(train_data)
lasso_model = cv_model.bestModel

6. Inspect the model coefficients and intercept

To better understand the Lasso regression model, you can examine its coefficients and intercept. These values represent the weights assigned to each feature and the bias term, respectively.

python
coefficients = lasso_model.coefficients
intercept = lasso_model.intercept

print("Coefficients: ", coefficients)
print("Intercept: {:.3f}".format(intercept))
python
Coefficients:  [-0.10986353851134442,0.04693115522519695,0.00923125788497033,2.8158935193006362,-17.689219861357042,3.5366213824650012,0.0036808784625555046,-1.3994140203542347,0.31100395792952057,-0.012550106878985027,-0.9431349440290867,0.008499950375369186,-0.5166808112793146]
Intercept: 37.871

7. Analyze feature importance

To determine which features contribute most to the model’s predictions, you can analyze the absolute values of the coefficients. Features with higher absolute coefficients have a greater impact on the target variable.

python
feature_importance = sorted(list(zip(data.columns[:-1], map(abs, coefficients))), key=lambda x: x[1], reverse=True)

print("Feature Importance:")
for feature, importance in feature_importance:
    print("  {}: {:.3f}".format(feature, importance))
python
Feature Importance:
  nox: 17.689
  rm: 3.537
  chas: 2.816
  dis: 1.399
  ptratio: 0.943
  lstat: 0.517
  rad: 0.311
  crim: 0.110
  zn: 0.047
  tax: 0.013
  indus: 0.009
  b: 0.008
  age: 0.004

8. Evaluating the model

To evaluate the performance of our Lasso Regression model, we’ll use the RegressionEvaluator class from PySpark MLlib.

python
# Make predictions on the test data
predictions = lasso_model.transform(test_data)

# Evaluate the model
rmse = evaluator.evaluate(predictions)
r2 = RegressionEvaluator(predictionCol="prediction", labelCol="medv", metricName="r2").evaluate(predictions)

print("Root Mean Squared Error (RMSE):", rmse)
print("Coefficient of Determination (R2):", r2)
python
Root Mean Squared Error (RMSE): 4.667802242354984
Coefficient of Determination (R2): 0.7935066845497476

9. Save and load the model (optional)

If you want to reuse the model in the future, you can save it to disk and load it back when needed.

python
# Save the model
lasso_model.save("lasso_model")

# Load the model
from pyspark.ml.regression import LinearRegressionModel
loaded_model = LinearRegressionModel.load("lasso_model")

Conclusion

In this blog post, we’ve demonstrated how to build and evaluate a Lasso Regression model using PySpark MLlib, with a focus on hyperparameter tuning to improve model performance.

The steps we covered include setting up the environment, importing required libraries, loading the dataset, data preprocessing, building the Lasso Regression model, performing hyperparameter tuning using cross-validation and grid search, and evaluating the model’s performance.

By incorporating hyperparameter tuning, you can optimize your Lasso Regression model to achieve better performance and more accurate predictions.

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science