Menu

Install PySpark on Windows – A Step-by-Step Guide to Install PySpark on Windows with Code Examples

Written by Jagdeesh | 3 min read

Introduction

Apache Spark is an open-source, distributed computing system that provides a fast and general-purpose cluster-computing framework for big data processing.

PySpark is the Python library for Spark, and it enables you to use Spark with the Python programming language.

This blog post will guide you through the process of installing PySpark on your Windows operating system and provide code examples to help you get started.

Prerequisites

1. Python 3.6 or later: Download and install Python from the official website (https://www.python.org/downloads/). Make sure to add Python to your PATH during installation.

2. Java 8: Download and install the Java Development Kit (JDK) 8 from Oracle’s website (https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html). Set the JAVA_HOME environment variable to the installation directory.

1. Install Apache Spark

  1. Download the latest version of Apache Spark from the official website (https://spark.apache.org/downloads.html). Select the package type as “Pre-built for Apache Hadoop”.

  2. Extract the downloaded .tgz file to a directory, e.g., C:\spark.

  3. Set the SPARK_HOME environment variable to the extracted directory path, e.g., C:\spark.

2. Install Hadoop

  1. Download the latest version of Hadoop from the official website (https://hadoop.apache.org/releases.html).

  2. Extract the downloaded .tar.gz file to a directory, e.g., C:\hadoop.

  3. Set the HADOOP_HOME environment variable to the extracted directory path, e.g., C:\hadoop.

3. Install PySpark using pip

Open a Command Prompt with administrative privileges and execute the following command to install PySpark using the Python package manager pip:

python
pip install findspark
pip install pyspark

4. Install winutils.exe

Since Hadoop is not natively supported on Windows, we need to use a utility called ‘winutils.exe’ to run Spark.

Download the appropriate version of winutils.exe for your Hadoop version from the following repository: https://github.com/steveloughran/winutils.

Create a new directory called ‘hadoop’ in your C: drive (C:\hadoop) and a subdirectory called ‘bin’ (C:\hadoop\bin). Place the downloaded ‘winutils.exe’ file in the ‘bin’ directory.

5. Set the Environment Variables

a) Open the System Properties dialog by right-clicking on ‘This PC’ or ‘Computer’, then selecting ‘Properties’.

b) Click on ‘Advanced system settings’ and then the ‘Environment Variables’ button.

c) Under ‘System variables’, click on the ‘New’ button and add the following environment

python
### variables:

    Variable Name: HADOOP_HOME

    Variable Value: C:\hadoop

    Variable Name: SPARK_HOME

    Variable Value: %USERPROFILE%\AppData\Local\Programs\Python\Python{your_python_version}\Lib\site-packages\pyspark

    Replace {your_python_version} with your installed Python version, e.g., Python39 for Python 3.9.

d) Edit the ‘Path’ variable under ‘System variables’ by adding the following entries:

python
    %HADOOP_HOME%\bin

    %SPARK_HOME%\bin

e) Click ‘OK’ to save the changes.

6. Test the PySpark Installation

To test the PySpark installation, open a new Command Prompt and enter the following command:

python
pyspark

If everything is set up correctly, you should see the PySpark shell starting up, and you can begin using PySpark for your big data processing tasks.

7. Example Code

Here’s a simple example of using PySpark to count the number of occurrences of each word in a text file:

python
import findspark
findspark.init()

from pyspark import SparkConf, SparkContext

# Configure Spark
conf = SparkConf().setAppName("WordCount")
sc = SparkContext(conf=conf)

# Read input file
text_file = sc.textFile("input.txt")

# Perform word count
word_counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

# Save results to a file
word_counts.saveAsTextFile("output")

# Stop Spark context
sc.stop()

Create an input file named input.txt with some text content.

Run the Python script using the following command:

python
spark-submit word_count.py

After the script finishes executing, you should see an “output” folder containing the word count results.

Conclusion

Congratulations! You have successfully installed PySpark on your Windows operating system and executed a simple word count example.

You can now start exploring the powerful features of PySpark to process large datasets and build scalable data processing pipelines.

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science