Menu

Install PySpark on Linux – A Step-by-Step Guide to Install PySpark on Linux with Example Code

Written by Jagdeesh | 2 min read

Introduction

Apache PySpark is an open-source, powerful, and user-friendly framework for large-scale data processing. It combines the power of Apache Spark with Python’s simplicity, making it a popular choice among data scientists and engineers.

In this blog post, we will walk you through the installation process of PySpark on a Linux operating system and provide example code to get you started with your first PySpark project.

Prerequisites

Before installing PySpark, make sure that the following software is installed on your Linux machine:

Python 3.6 or later

Java Development Kit (JDK) 8 or later

Apache Spark

1. Install Java Development Kit (JDK)

First, update the package index by running:

python
sudo apt update

Next, install the default JDK using the following command:

python
sudo apt install default-jdk

Verify the installation by checking the Java version:

python
java -version

2. Install Apache Spark

Download the latest version of Apache Spark from the official website (https://spark.apache.org/downloads.html). At the time of writing, the latest version is Spark 3.2.0. Choose the package type as “Pre-built for Apache Hadoop 3.2 and later”.

Use the following commands to download and extract the Spark archive:

python
wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
tar -xvzf spark-3.2.0-bin-hadoop3.2.tgz

Move the extracted folder to the /opt directory

python
sudo mv spark-3.2.0-bin-hadoop3.2 /opt/spark

3. Set Up Environment Variables

Add the following lines to your ~/.bashrc file to set up the required environment variables:

python
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Source the updated ~/.bashrc file to apply the changes:

python
source ~/.bashrc

4. Install PySpark

Install PySpark using pip:

python
pip install pyspark

5. Verify PySpark Installation

Create a new Python file called pyspark_test.py and add the following code:

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PySpark Test").getOrCreate()

data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]

columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

df.show()

spark.stop()

Run the script using:

python
python pyspark_test.py

If everything is set up correctly, you should see the following output:

python
+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 45|
|Cathy| 29|
+-----+---+

Conclusion

Congratulations! You have successfully installed PySpark on your Linux operating system and executed a simple PySpark script. You can now start building more complex data processing pipelines using PySpark.

Don’t forget to explore the official PySpark documentation (https://spark.apache.org/docs/latest/api/python/index.html) for more information and advanced use cases. Happy coding!

Free Course
Master Core Python โ€” Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark โ€” Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
๐Ÿ“š 10 Courses
๐Ÿ Python & ML
๐Ÿ—„๏ธ SQL
๐Ÿ“ฆ Downloads
๐Ÿ“… 1 Year Access
No thanks
๐ŸŽ“
Free AI/ML Starter Kit
Python ยท SQL ยท ML ยท 10 Courses ยท 57,000+ students
๐ŸŽ‰   You're in! Check your inbox (or Promotions/Spam) for the access link.
โšก Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

๐Ÿ
Core Python โ€” from first line to expert level
๐Ÿ“ˆ
NumPy & Pandas โ€” the #1 libraries every DS job needs
๐Ÿ—ƒ๏ธ
SQL Levels Iโ€“III โ€” basics to Window Functions
๐Ÿ“„
Real industry data โ€” Jupyter notebooks included
R A M S K
57,000+ students
โ˜…โ˜…โ˜…โ˜…โ˜… Rated 4.9/5
โšก Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  โ˜…โ˜…โ˜…โ˜…โ˜… 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
๐Ÿ”’ 100% free โ˜• No spam, ever โœ“ Instant access
๐Ÿš€
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course โ†’
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science