Menu

Run SQL Queries with PySpark – A Step-by-Step Guide to run SQL Queries in PySpark with Example Code

Written by Jagdeesh | 3 min read

Introduction

One of the core features of Spark is its ability to run SQL queries on structured data. In this blog post, we will explore how to run SQL queries in PySpark and provide example code to get you started. By the end of this post, you should have a better understanding of how to work with SQL queries in PySpark.

Table of Contents

  1. Setting up PySpark

  2. Loading Data into a DataFrame

  3. Creating a Temporary View

  4. Running SQL Queries

  5. Example: Analyzing Sales Data

  6. Conclusion

  7. Setting up PySpark

1. Setting up PySpark

Before running SQL queries in PySpark, you’ll need to install it. You can install PySpark using pip

python
pip install pyspark

To start a PySpark session, import the SparkSession class and create a new instance

python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Running SQL Queries in PySpark") \
    .getOrCreate()

2. Loading Data into a DataFrame

To run SQL queries in PySpark, you’ll first need to load your data into a DataFrame. DataFrames are the primary data structure in Spark, and they can be created from various data sources, such as CSV, JSON, and Parquet files, as well as Hive tables and JDBC databases.

For example, to load a CSV file into a DataFrame, you can use the following code

python
csv_file = "path/to/your/csv_file.csv"

df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv(csv_file)

3. Creating a Temporary View

Once you have your data in a DataFrame, you can create a temporary view to run SQL queries against it. A temporary view is a named view of a DataFrame that is accessible only within the current Spark session.

To create a temporary view, use the createOrReplaceTempView method

python
df.createOrReplaceTempView("sales_data")

4. Running SQL Queries

With your temporary view created, you can now run SQL queries on your data using the spark.sql() method. This method returns the result of the query as a new DataFrame.

For example, to select all rows from the “sales_data” view

python
result = spark.sql("SELECT * FROM sales_data")
result.show()

5. Example: Analyzing Sales Data

Let’s analyze some sales data to see how SQL queries can be used in PySpark. Suppose we have the following sales data in a CSV file

python
OrderID,ProductID,Quantity,Price,OrderDate
1,101,3,100,2023-01-01
2,102,1,200,2023-01-02
3,101,2,100,2023-01-03
4,103,5,50,2023-01-04

We can calculate the total revenue for each product using the following code

python
# Load the data into a DataFrame
csv_file = "path/to/your/csv_file.csv"
df = spark.read.option("header", "true").option("inferSchema", "true").csv(csv_file)

# Create a temporary view
df.createOrReplaceTempView("sales_data")
python
# Calculate the total revenue for each product

query = """
SELECT
ProductID,
SUM(Quantity * Price) as TotalRevenue
FROM sales_data
GROUP BY ProductID
"""

result = spark.sql(query)
result.show()

This query will output the following results:

python
+---------+------------+
|ProductID|TotalRevenue|
+---------+------------+
| 101| 500|
| 102| 200|
| 103| 250|
+---------+------------+

You can also use more complex SQL queries to analyze your data. For example, to find the top 2 products with the highest revenue

python
query = """
SELECT
    ProductID,
    SUM(Quantity * Price) as TotalRevenue
FROM
    sales_data
GROUP BY
    ProductID
ORDER BY
    TotalRevenue DESC
LIMIT 2
"""

result = spark.sql(query)
result.show()

This query will output the following results

python
+---------+------------+
|ProductID|TotalRevenue|
+---------+------------+
|      101|         500|
|      102|         200|
+---------+------------+

6. Conclusion

In this blog post, we have demonstrated how to execute SQL queries in PySpark using DataFrames and temporary views. This powerful feature allows you to leverage your SQL skills to analyze and manipulate large datasets in a distributed environment using Python.

By following the steps outlined in this guide, you can easily integrate SQL queries into your PySpark applications, enabling you to perform complex data analysis tasks with ease.

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science