Menu

PySpark Drop Columns – Eliminate Unwanted Columns in PySpark DataFrame with Ease

Detailed blog post on using PySpark's Drop() function to remove columns from a DataFrame, explore various use cases to understand its versatility and importance in data manipulation

Written by Jagdeesh | 3 min read

Welcome to this detailed blog post on using PySpark’s Drop() function to remove columns from a DataFrame. Lets delve into the mechanics of the Drop() function and explore various use cases to understand its versatility and importance in data manipulation.

This post is a perfect starting point for those looking to expand their understanding of PySpark and improve their data wrangling skills.

Creating a DataFrame

Before we dive into the Drop() function, let’s create a DataFrame to work with. In this example, we will create a simple DataFrame with four columns: “name”, “age”, “city”, and “gender.”

python
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark Drop Column Example") \
    .getOrCreate()

data = [("Alice", 30, "New York", "F"),
        ("Bob", 28, "San Francisco", "M"),
        ("Cathy", 29, "Los Angeles", "F"),
        ("David", 32, "Chicago", "M")]

columns = ["name", "age", "city", "gender"]

df = spark.createDataFrame(data, columns)
df.show()
python
+-----+---+-------------+------+
| name|age|         city|gender|
+-----+---+-------------+------+
|Alice| 30|     New York|     F|
|  Bob| 28|San Francisco|     M|
|Cathy| 29|  Los Angeles|     F|
|David| 32|      Chicago|     M|
+-----+---+-------------+------+

Different ways to drop columns in PySpark DataFrame

  1. Dropping a Single Column
  2. Dropping Multiple Columns
  3. Dropping Columns Conditionally
  4. Dropping Columns Using Regex Pattern

1. Dropping a Single Column

The Drop() function can be used to remove a single column from a DataFrame. The syntax is as follows

python
df = df.drop("gender")

df.show()
python
+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

2. Dropping Multiple Columns:

You can also use the Drop() function to remove multiple columns from a DataFrame. Simply pass a list of column names to the function

For example, let’s remove both “age” and “gender” columns

python
df = df.drop("age", "gender")

df.show()
python
+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

Alternatively, you can use a list of column names

python
columns_to_drop = ["age", "gender"]

df = df.drop(*columns_to_drop)

df.show()
python
+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

3. Dropping Columns Conditionally

You might want to drop columns based on a specific condition. You can use the Drop() function in combination with the “if” statement to achieve this

python
if "gender" in df.columns:
    df = df.drop("gender")

df.show()
python
+-----+---+-------------+
| name|age|         city|
+-----+---+-------------+
|Alice| 30|     New York|
|  Bob| 28|San Francisco|
|Cathy| 29|  Los Angeles|
|David| 32|      Chicago|
+-----+---+-------------+

4. Dropping Columns Using Regex Pattern

You can use the “drop()” function in combination with a regular expression (regex) pattern to drop multiple columns matching the pattern.

python
from pyspark.sql.functions import col
import re

regex_pattern = "gender|age"
df = df.select([col(c) for c in df.columns if not re.match(regex_pattern, c)])

df.show()
python
+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

Conclusion

In this blog post, we learned about the PySpark Drop() function and its various use cases. We explored how to remove single and multiple columns, drop columns conditionally, and remove columns using a regex pattern.

With a solid understanding of the PySpark Drop() function, you can now effectively manipulate your data to suit your needs.

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science