Install PySpark on Windows – A Step-by-Step Guide to Install PySpark on Windows with Code Examples

This blog post will guide you through the process of installing PySpark on your Windows operating system and provide code examples to help you get started.

Written by Jagdeesh | 3 min read

Introduction

Apache Spark is an open-source, distributed computing system that provides a fast and general-purpose cluster-computing framework for big data processing.

PySpark is the Python library for Spark, and it enables you to use Spark with the Python programming language.

This blog post will guide you through the process of installing PySpark on your Windows operating system and provide code examples to help you get started.

Prerequisites

1. Python 3.6 or later: Download and install Python from the official website (https://www.python.org/downloads/). Make sure to add Python to your PATH during installation.

2. Java 8: Download and install the Java Development Kit (JDK) 8 from Oracle’s website (https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html). Set the JAVA_HOME environment variable to the installation directory.

1. Install Apache Spark

Download the latest version of Apache Spark from the official website (https://spark.apache.org/downloads.html). Select the package type as “Pre-built for Apache Hadoop”.
Extract the downloaded .tgz file to a directory, e.g., C:\spark.
Set the SPARK_HOME environment variable to the extracted directory path, e.g., C:\spark.

2. Install Hadoop

Download the latest version of Hadoop from the official website (https://hadoop.apache.org/releases.html).
Extract the downloaded .tar.gz file to a directory, e.g., C:\hadoop.
Set the HADOOP_HOME environment variable to the extracted directory path, e.g., C:\hadoop.

3. Install PySpark using pip

Open a Command Prompt with administrative privileges and execute the following command to install PySpark using the Python package manager pip:

python

pip install findspark
pip install pyspark

4. Install winutils.exe

Since Hadoop is not natively supported on Windows, we need to use a utility called ‘winutils.exe’ to run Spark.

Download the appropriate version of winutils.exe for your Hadoop version from the following repository: https://github.com/steveloughran/winutils.

Create a new directory called ‘hadoop’ in your C: drive (C:\hadoop) and a subdirectory called ‘bin’ (C:\hadoop\bin). Place the downloaded ‘winutils.exe’ file in the ‘bin’ directory.

5. Set the Environment Variables

a) Open the System Properties dialog by right-clicking on ‘This PC’ or ‘Computer’, then selecting ‘Properties’.

b) Click on ‘Advanced system settings’ and then the ‘Environment Variables’ button.

c) Under ‘System variables’, click on the ‘New’ button and add the following environment

python

### variables:

    Variable Name: HADOOP_HOME

    Variable Value: C:\hadoop

    Variable Name: SPARK_HOME

    Variable Value: %USERPROFILE%\AppData\Local\Programs\Python\Python{your_python_version}\Lib\site-packages\pyspark

    Replace {your_python_version} with your installed Python version, e.g., Python39 for Python 3.9.

d) Edit the ‘Path’ variable under ‘System variables’ by adding the following entries:

python

    %HADOOP_HOME%\bin

    %SPARK_HOME%\bin

e) Click ‘OK’ to save the changes.

6. Test the PySpark Installation

To test the PySpark installation, open a new Command Prompt and enter the following command:

python

pyspark

If everything is set up correctly, you should see the PySpark shell starting up, and you can begin using PySpark for your big data processing tasks.

7. Example Code

Here’s a simple example of using PySpark to count the number of occurrences of each word in a text file:

python

import findspark
findspark.init()

from pyspark import SparkConf, SparkContext

# Configure Spark
conf = SparkConf().setAppName("WordCount")
sc = SparkContext(conf=conf)

# Read input file
text_file = sc.textFile("input.txt")

# Perform word count
word_counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

# Save results to a file
word_counts.saveAsTextFile("output")

# Stop Spark context
sc.stop()

Create an input file named input.txt with some text content.

Run the Python script using the following command:

python

spark-submit word_count.py

After the script finishes executing, you should see an “output” folder containing the word count results.

Conclusion

Congratulations! You have successfully installed PySpark on your Windows operating system and executed a simple word count example.

You can now start exploring the powerful features of PySpark to process large datasets and build scalable data processing pipelines.

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Jagdeesh →

Related Course

Master PySpark — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Install PySpark on Windows – A Step-by-Step Guide to Install PySpark on Windows with Code Examples

Introduction

Prerequisites

1. Install Apache Spark

2. Install Hadoop

3. Install PySpark using pip

4. Install winutils.exe

5. Set the Environment Variables

6. Test the PySpark Installation

7. Example Code

Conclusion

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Introduction

Prerequisites

1. Install Apache Spark

2. Install Hadoop

3. Install PySpark using pip

4. Install winutils.exe

5. Set the Environment Variables

6. Test the PySpark Installation

7. Example Code

Conclusion

Related Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.