Menu

PySpark Filter vs Where – Comprehensive Guide Filter Rows from PySpark DataFrame

Written by Jagdeesh | 3 min read

Apache PySpark is a popular open-source distributed data processing engine built on top of the Apache Spark framework. It provides a high-level API for handling large-scale data processing tasks in Python, Scala, and Java.

One of the most common tasks when working with PySpark DataFrames is filtering rows based on certain conditions. In this blog post, we’ll discuss different ways to filter rows in PySpark DataFrames, along with code examples for each method.

Different ways to filter rows in PySpark DataFrames

1. Filtering Rows Using ‘filter’ Function

2. Filtering Rows Using ‘where’ Function

3. Filtering Rows Using SQL Queries

4. Combining Multiple Filter Conditions

Before we dive into filtering rows, let’s quickly review some basics of PySpark DataFrames. To work with PySpark DataFrames, we first need to import the necessary modules and create a SparkSession

python
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Filtering Rows in PySpark DataFrames") \
    .getOrCreate()

Next, let’s create a simple DataFrame to use in our examples

python
from pyspark.sql import Row

data = [
    Row(id=1, name="Alice", age=30),
    Row(id=2, name="Bob", age=25),
    Row(id=3, name="Charlie", age=35),
    Row(id=4, name="David", age=28)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)
df.show()
python
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  2|    Bob| 25|
|  3|Charlie| 35|
|  4|  David| 28|
+---+-------+---+

1. Filtering Rows Using ‘filter’ Function

The filter function is one of the most straightforward ways to filter rows in a PySpark DataFrame. It takes a boolean expression as an argument and returns a new DataFrame containing only the rows that satisfy the condition.

Example: Filter rows with age greater than 30

python
filtered_df = df.filter(df.age > 29)

filtered_df.show()
python
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  3|Charlie| 35|
+---+-------+---+

2. Filtering Rows Using ‘where’ Function

The where function is an alias for the ‘filter’ function and can be used interchangeably. It also takes a boolean expression as an argument and returns a new DataFrame containing only the rows that satisfy the condition.

Example: Filter rows with name equal to “Alice”:

python
filtered_df = df.where(df.name.isin(["Alice", "Charlie"]))

filtered_df.show()
python
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  3|Charlie| 35|
+---+-------+---+

3. Filtering Rows Using SQL Queries

PySpark also supports executing SQL queries to filter rows in a DataFrame. First, you need to register your DataFrame as a temporary table using the ‘createOrReplaceTempView’ function. Then, you can execute SQL queries using the ‘sql’ function.

Example: Filter rows with age less than or equal to 25

python
df.createOrReplaceTempView("people")

filtered_df = spark.sql("SELECT * FROM people WHERE age <= 25")
filtered_df.show()
python
+---+----+---+
| id|name|age|
+---+----+---+
|  2| Bob| 25|
+---+----+---+

4. Combining Multiple Filter Conditions

You can combine multiple filter conditions using the ‘&’ (and), ‘|’ (or), and ‘~’ (not) operators. Make sure to use parentheses to separate different conditions, as it helps maintain the correct order of operations.

Example: Filter rows with age greater than 25 and name not equal to “David”

python
filtered_df = df.filter((df.age > 25) & (df.name != "David"))

filtered_df.show()
python
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  3|Charlie| 35|
+---+-------+---+

we covered different ways to filter rows in PySpark DataFrames, including using the ‘filter’, ‘where’ functions, SQL queries, and combining multiple filter.

Free Course
Master Core Python โ€” Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark โ€” Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
๐Ÿ“š 10 Courses
๐Ÿ Python & ML
๐Ÿ—„๏ธ SQL
๐Ÿ“ฆ Downloads
๐Ÿ“… 1 Year Access
No thanks
๐ŸŽ“
Free AI/ML Starter Kit
Python ยท SQL ยท ML ยท 10 Courses ยท 57,000+ students
๐ŸŽ‰   You're in! Check your inbox (or Promotions/Spam) for the access link.
โšก Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

๐Ÿ
Core Python โ€” from first line to expert level
๐Ÿ“ˆ
NumPy & Pandas โ€” the #1 libraries every DS job needs
๐Ÿ—ƒ๏ธ
SQL Levels Iโ€“III โ€” basics to Window Functions
๐Ÿ“„
Real industry data โ€” Jupyter notebooks included
R A M S K
57,000+ students
โ˜…โ˜…โ˜…โ˜…โ˜… Rated 4.9/5
โšก Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  โ˜…โ˜…โ˜…โ˜…โ˜… 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
๐Ÿ”’ 100% free โ˜• No spam, ever โœ“ Instant access
๐Ÿš€
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course โ†’
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science