Menu

Select columns in PySpark dataframe – A Comprehensive Guide to Selecting Columns in different ways in PySpark dataframe

Written by Jagdeesh | 3 min read

Apache PySpark is a powerful big data processing framework, which allows you to process large volumes of data using the Python programming language. PySpark’s DataFrame API is a powerful tool for data manipulation and analysis.

One of the most common tasks when working with DataFrames is selecting specific columns. In this blog post, we will explore different ways to select columns in PySpark DataFrames, accompanied by example code for better understanding.

1. Selecting Columns using column names

The select function is the most straightforward way to select columns from a DataFrame. You can specify the columns by their names as arguments or by using the ‘col’ function from the ‘pyspark.sql.functions’ module

python
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.master("local").appName("SelectColumns").getOrCreate()

data = [("Alice", 34, "Female"),
        ("Bob", 45, "Male"),
        ("Charlie", 28, "Male"),
        ("Diana", 39, "Female")]

columns = ["Name", "Age", "Gender"]

df = spark.createDataFrame(data, columns)

# Select columns using column names
selected_df1 = df.select("Name", "Age")

# Select columns using the 'col' function
selected_df2 = df.select(col("Name"), col("Age"))

df.show()

selected_df1.show()

selected_df2.show()
python
+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Alice| 34|Female|
|    Bob| 45|  Male|
|Charlie| 28|  Male|
|  Diana| 39|Female|
+-------+---+------+

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 34|
|    Bob| 45|
|Charlie| 28|
|  Diana| 39|
+-------+---+

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 34|
|    Bob| 45|
|Charlie| 28|
|  Diana| 39|
+-------+---+

2. Selecting Columns using the ‘[ ]’ Operator

You can also use the ‘[ ]’ operator to select specific columns from a DataFrame, similar to the pandas library.

python
# Select a single column using the '[]' operator
name_df = df["Name"]

# Select multiple columns using the '[]' operator
selected_df3 = df.select(df["Name"], df["Age"])

selected_df3.show()
python
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 34|
|    Bob| 45|
|Charlie| 28|
|  Diana| 39|
+-------+---+

3. Select Columns using index

In PySpark, you can’t directly select columns from a DataFrame using column indices. However, you can achieve this by first extracting the column names based on their indices and then selecting those columns.

python
# Define the column indices you want to select
column_indices = [0, 2]

# Extract column names based on indices
selected_columns = [df.columns[i] for i in column_indices]

# Select columns using extracted column names
selected_df4 = df.select(selected_columns)

# Show the result DataFrame
selected_df4.show()
python
+-------+------+
|   Name|Gender|
+-------+------+
|  Alice|Female|
|    Bob|  Male|
|Charlie|  Male|
|  Diana|Female|
+-------+------+

4. Selecting Columns using the ‘withColumn’ and ‘drop’ Functions

If you want to select specific columns while adding or removing columns, you can use the ‘withColumn’ function to add a new column and the ‘drop’ function to remove a column.

python
# Add a new column 'IsAdult' and remove the 'Gender' column
selected_df5 = df.withColumn("IsAdult", col("Age") >= 18).drop("Gender")

selected_df5.show()
python
+-------+---+-------+
|   Name|Age|IsAdult|
+-------+---+-------+
|  Alice| 34|   true|
|    Bob| 45|   true|
|Charlie| 28|   true|
|  Diana| 39|   true|
+-------+---+-------+

5. Selecting Columns using SQL Expressions

You can also use SQL-like expressions to select columns using the ‘selectExpr’ function. This is useful when you want to perform operations on columns while selecting them.

python
# Select columns with an SQL expression
selected_df6 = df.selectExpr("Name", "Age", "Age >= 18 as IsAdult")

selected_df6.show()
python
+-------+---+-------+
|   Name|Age|IsAdult|
+-------+---+-------+
|  Alice| 34|   true|
|    Bob| 45|   true|
|Charlie| 28|   true|
|  Diana| 39|   true|
+-------+---+-------+

we have explored different ways to select columns in PySpark DataFrames, such as using the ‘select’, ‘[]’ operator, ‘withColumn’ and ‘drop’ functions, and SQL expressions.

Knowing how to use these techniques effectively will make your data manipulation tasks more efficient and help you unlock the full potential of PySpark.

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science