Menu

PySpark Joins – A Comprehensive Guide on PySpark Joins with Example Code

Written by Jagdeesh | 4 min read

Welcome to our blog post on PySpark join types. As an expert in the field, I am excited to share my knowledge with you. PySpark, the Apache Spark library for Python, provides a powerful and flexible framework for big data processing.

One of the most essential operations in data processing is joining datasets, which enables you to combine data from different sources based on a common key.

In this blog post, we will discuss the various join types supported by PySpark, explain their use cases, and provide example code for each type. So let’s dive in!

Type of Joins

  1. Inner Join

  2. Outer (Full) Join

  3. Left Join

  4. Right Join

  5. Left Semi Join

  6. Left Anti Join

  7. Cross Join

Importing necessary libraries and creating a sample DataFrames

python
import findspark
findspark.init()

from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.master("local").appName("PySpark Join Types").getOrCreate()

# Create sample dataframes
df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value1"])
df2 = spark.createDataFrame([(1, "X"), (2, "Y"), (4, "Z")], ["id", "value2"])

1. Inner Join

An inner join returns rows from both dataframes that have matching keys. In other words, it returns only the rows that have common keys in both dataframes. This is the default join type in PySpark.

python
# Perform inner join
result = df1.join(df2, on="id", how="inner")

# Show result
result.show()
python
+---+------+------+
| id|value1|value2|
+---+------+------+
|  1|     A|     X|
|  2|     B|     Y|
+---+------+------+

2. Outer (Full) Join

An outer join, also known as a full join, returns all rows from both dataframes. If a key is present in one dataframe but not in the other, the missing values are filled with nulls.

python
# Perform outer join
result = df1.join(df2, on="id", how="outer")

# Show result
result.show()
python
+---+------+------+
| id|value1|value2|
+---+------+------+
|  1|     A|     X|
|  2|     B|     Y|
|  3|     C|  null|
|  4|  null|     Z|
+---+------+------+

3. Left Join

A left join returns all rows from the left dataframe and the matched rows from the right dataframe. If no match is found for a key in the right dataframe, the result will contain null values.

python
# Perform left join
result = df1.join(df2, on="id", how="left")

# Show result
result.show()
python
+---+------+------+
| id|value1|value2|
+---+------+------+
|  1|     A|     X|
|  3|     C|  null|
|  2|     B|     Y|
+---+------+------+

4. Right Join

A right join returns all rows from the right dataframe and the matched rows from the left dataframe. If no match is found for a key in the left dataframe, the result will contain null values.

python
# Perform right join
result = df1.join(df2, on="id", how="right")

# Show result
result.show()
python
+---+------+------+
| id|value1|value2|
+---+------+------+
|  1|     A|     X|
|  2|     B|     Y|
|  4|  null|     Z|
+---+------+------+

5. Left Semi Join

A left semi join returns only the columns from the left dataframe for the rows with matching keys in both dataframes. It is similar to an inner join but only returns the columns from the left dataframe.

python
# Perform left semi join
result = df1.join(df2, on="id", how="left_semi")

# Show result
result.show()
python
+---+------+
| id|value1|
+---+------+
|  1|     A|
|  2|     B|
+---+------+

6. Left Anti Join

A left anti join returns the rows from the left dataframe that do not have matching keys in the right dataframe. It is the opposite of a left semi join.

python
# Perform left anti join
result = df1.join(df2, on="id", how="left_anti")

# Show result
result.show()
python
+---+------+
| id|value1|
+---+------+
|  3|     C|
+---+------+

7. Cross Join

A cross join, also known as a cartesian join, returns the cartesian product of both dataframes. It combines each row from the left dataframe with each row from the right dataframe.

python
# Perform cross join
result = df1.crossJoin(df2)

# Show result
result.show()
python
+---+------+---+------+
| id|value1| id|value2|
+---+------+---+------+
|  1|     A|  1|     X|
|  1|     A|  2|     Y|
|  1|     A|  4|     Z|
|  2|     B|  1|     X|
|  2|     B|  2|     Y|
|  2|     B|  4|     Z|
|  3|     C|  1|     X|
|  3|     C|  2|     Y|
|  3|     C|  4|     Z|
+---+------+---+------+

In this blog post, we have explored the various join types available in PySpark, including inner, outer, left, right, left semi, left anti, and cross joins. Each join type has its own unique use case, and understanding how to use them effectively can help you manipulate and analyze large datasets with ease.

With the help of the example code provided, you should now have a solid understanding of how to perform these join operations in PySpark. Happy data processing!

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science