Menu

Read and Write files using PySpark – Multiple ways to Read and Write data using PySpark

Written by Jagdeesh | 3 min read

Introduction

Apache PySpark is an open-source, distributed computing system designed for big data processing and analytics. It provides an interface for programming Apache Spark with the Python programming language.

One of the most important tasks in data processing is reading and writing data to various file formats. In this blog post, we will explore multiple ways to read and write data using PySpark with code examples.

1. Prerequisites

To follow this tutorial, you’ll need to have PySpark installed on your system. You can install PySpark using pip:

python
pip install pyspark
pip install findspark

2. Initializing Spark Session

Before we dive into reading and writing data, let’s initialize a SparkSession. The SparkSession is the entry point to PySpark and allows you to interact with the data

python
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Read and Write Data Using PySpark") \
    .getOrCreate()

3. Create a PySpark DataFrame

Creating a DataFrame from a Python list of dictionaries:

python
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]

columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

df.show()
python
+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 45|
|Cathy| 29|
+-----+---+

4. Reading and Writing CSV Files

To read a CSV file using PySpark, you can use the read.csv() method:

python
csv_file = "path/to/your/csv/file.csv"

df_csv = spark.read.csv(csv_file, header=True, inferSchema=True)

Now that you have your data in a DataFrame, you can write it back to a CSV file using the write.csv() method:

python
output_path = "path/to/output/csv/file.csv"

df_csv.write.csv(output_path, header=True, mode="overwrite")

5. Reading and Writing JSON Files

To read a JSON file using PySpark, you can use the read.json() method:

python
json_file = "path/to/your/json/file.json"

df_json = spark.read.json(json_file)

You can write the data back to a JSON file using the write.json() method:

python
output_path = "path/to/output/json/file.json"

df_json.write.json(output_path, mode="overwrite")

6. Reading and Writing Parquet Files

To read a Parquet file using PySpark, you can use the read.parquet() method:

python
parquet_file = "path/to/your/parquet/file.parquet"

df_parquet = spark.read.parquet(parquet_file)

To write the data back to a Parquet file, use the write.parquet() method:

python
output_path = "path/to/output/parquet/file.parquet"

df_parquet.write.parquet(output_path, mode="overwrite")

7. Creating a SQL Table in PySpark

We’ll create a sample DataFrame using a list of dictionaries and register the DataFrame as a temporary SQL table to perform SQL operations

python
data = [
    {"name": "Alice", "age": 30, "city": "New York"},
    {"name": "Bob", "age": 25, "city": "San Francisco"},
    {"name": "Charlie", "age": 35, "city": "Los Angeles"}
]

df = spark.createDataFrame(data)

df.createOrReplaceTempView("people")

query = "SELECT * FROM people WHERE age >= 30"
result_df = spark.sql(query)
result_df.show()
python
+---+-----------+-------+
|age|       city|   name|
+---+-----------+-------+
| 30|   New York|  Alice|
| 35|Los Angeles|Charlie|
+---+-----------+-------+

8. Convert Pandas Dataframe to PySpark Dataframe

python
import pandas as pd
data = [
    {"name": "Alice", "age": 30, "city": "New York"},
    {"name": "Bob", "age": 25, "city": "San Francisco"},
    {"name": "Charlie", "age": 35, "city": "Los Angeles"}
]

pandasDF = pd.DataFrame(data, columns = ['name', 'age', 'city'])

#Create PySpark DataFrame from Pandas
sparkDF=spark.createDataFrame(pandasDF) 

sparkDF.printSchema()
sparkDF.show()
python
root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- city: string (nullable = true)

+-------+---+-------------+
|   name|age|         city|
+-------+---+-------------+
|  Alice| 30|     New York|
|    Bob| 25|San Francisco|
|Charlie| 35|  Los Angeles|
+-------+---+-------------+

Conclusion

In this blog post, we explored multiple ways to read and write data using PySpark, including CSV, JSON, Parquet, SQL, Pandas Data Frame

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science