Menu

PySpark Statistics Median – Calculating the Median in PySpark a Comprehensive Guide for Everyone

Written by Jagdeesh | 4 min read

Lets explore different ways of calculating the Median using PySpark, helping you become an expert

As data continues to grow exponentially, efficient data processing becomes critical for extracting meaningful insights. PySpark, an Apache Spark library, enables large-scale data processing in Python.

How to Calcualte Median?

The median is a measure of central tendency that represents the middle value in a dataset when the values are sorted in ascending or descending order. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.

1) Arrange the data in ascending or descending order.

2) Determine the position of the median using the formula:

python
Median Position (P) = (n + 1) / 2

where 'n' is the total number of values in the dataset.

3) If the dataset has an odd number of values, the median is the value at the median position.

4) If the dataset has an even number of values, the median is the average of the values at positions (P – 0.5) and (P + 0.5).

How to use Median in Statistics and Machine Learning

1. Descriptive Statistics: The median is used to describe the central tendency of a dataset, offering a more accurate representation of the data’s center than the mean in cases of skewed distributions or the presence of outliers.

2. Exploratory Data Analysis (EDA): The median is often used during EDA to identify trends, patterns, or potential anomalies in the data.

3. Machine Learning Algorithms: In machine learning, the median is used for data preprocessing tasks such as filling missing values or normalizing data. It can also be utilized as a robust loss function in regression tasks.

4. Non-parametric methods: In machine learning, non-parametric methods make fewer assumptions about the underlying data distribution. The median is often used in these methods, such as the k-nearest neighbors algorithm, where it can be used to predict the target variable by taking the median of the k-nearest neighbors.

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

python
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Calculating Median with PySpark") \
    .getOrCreate()

2. How to calculate the Median of a list

A. How to calculate the Median of a list using RDD (Resilient Distributed Dataset)

python
sc = spark.sparkContext

data = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
rdd = sc.parallelize(data)

sorted_rdd = rdd.sortBy(lambda x: x)
n = sorted_rdd.count()

if n % 2 == 0:
    median = (sorted_rdd.take(n // 2)[-1] + sorted_rdd.take(n // 2 + 1)[0]) / 2
else:
    median = sorted_rdd.take(n // 2 + 1)[-1]

print(f"Median: {median}")
python
Median: 5.0

B. How to calculate the Median of a list using PySpark approxQuantile() function.**

you first need to convert the list into a DataFrame and then use the approxQuantile() function.

python
# Create a list
data_list = [1, 9, 3, 4, 5, 7, 11, 8, 2, 10, 6]

# Convert the list into a DataFrame
data_df = spark.createDataFrame([(value,) for value in data_list], ["values"])

median = data_df.approxQuantile("values", [0.5], 0.01)

# Print the results
for value in median: print(f" Median: {value}")
python
 Median: 6.0

3. Preparing the Sample Data

To demonstrate the different methods of calculating the median, we’ll use a sample dataset containing three columns: id, age, and income. First, let’s load the data into a DataFrame:

python
data = [("1", 25, 40000),
        ("2", 30, 60000),
        ("3", 28, 50000),
        ("4", 35, 70000),
        ("5", 32, 55000)]

columns = ["id", "age", "income"]

df = spark.createDataFrame(data, columns)
df.show()
python
+---+---+------+
| id|age|income|
+---+---+------+
|  1| 25| 40000|
|  2| 30| 60000|
|  3| 28| 50000|
|  4| 35| 70000|
|  5| 32| 55000|
+---+---+------+

4. How to calculate Median of a PySpark DataFrame column

There are several ways to calculate the Median of a DataFrame column in PySpark. We’ll explore three popular methods here

A. How to calculate the Median of a PySaprk DataFrame columns using PySpark approxQuantile() function

python
# Calculate the median for multiple columns
def calculate_median(dataframe, columns):
    medians = {}
    for column in columns:
        # Use approxQuantile to get the median (0.5 quantile)
        median = dataframe.approxQuantile(column, [0.5], 0.0)[0]
        medians[column] = median
    return medians

columns = ["age", "income"]
median_dict = calculate_median(df, columns)

# Print the result
print("Median:")
for column, median in median_dict.items():
    print(f"{column}: {median}")
python
Median:
age: 30.0
income: 55000.0

B. How to calculate the Median of a PySaprk DataFrame column using PySpark Window function

Another method to calculate the median is to use the percent_rank() window function, which assigns a percentile rank to each row within a partition. This method is more accurate than the approxQuantile() but can be slower for large datasets.

python
from pyspark.sql.window import Window
from pyspark.sql.functions import percent_rank, col

# Calculate the median
window_spec = Window.orderBy("age")
df = df.withColumn("percent_rank", percent_rank().over(window_spec))

median = df.filter(col("percent_rank") == 0.5).collect()[0]["age"]

print("Median:", median)

python
Median: 30

Conclusion

We’ve explored different methods for calculating the median in PySpark. Depending on your use case and the size of your dataset, you can choose the method that best suits your needs. As you continue your journey with PySpark, understanding these techniques will undoubtedly serve you well

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science