PySpark Statistics Median – Calculating the Median in PySpark a Comprehensive Guide for Everyone

Explore different ways of calculating the Median using PySpark, helping you become an expert

Lets explore different ways of calculating the Median using PySpark, helping you become an expert

As data continues to grow exponentially, efficient data processing becomes critical for extracting meaningful insights. PySpark, an Apache Spark library, enables large-scale data processing in Python.

How to Calcualte Median?

The median is a measure of central tendency that represents the middle value in a dataset when the values are sorted in ascending or descending order. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.

1) Arrange the data in ascending or descending order.

2) Determine the position of the median using the formula:

python

Median Position (P) = (n + 1) / 2

where 'n' is the total number of values in the dataset.

3) If the dataset has an odd number of values, the median is the value at the median position.

4) If the dataset has an even number of values, the median is the average of the values at positions (P – 0.5) and (P + 0.5).

How to use Median in Statistics and Machine Learning

1. Descriptive Statistics: The median is used to describe the central tendency of a dataset, offering a more accurate representation of the data’s center than the mean in cases of skewed distributions or the presence of outliers.

2. Exploratory Data Analysis (EDA): The median is often used during EDA to identify trends, patterns, or potential anomalies in the data.

3. Machine Learning Algorithms: In machine learning, the median is used for data preprocessing tasks such as filling missing values or normalizing data. It can also be utilized as a robust loss function in regression tasks.

4. Non-parametric methods: In machine learning, non-parametric methods make fewer assumptions about the underlying data distribution. The median is often used in these methods, such as the k-nearest neighbors algorithm, where it can be used to predict the target variable by taking the median of the k-nearest neighbors.

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

python

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Calculating Median with PySpark") \
    .getOrCreate()

2. How to calculate the Median of a list

A. How to calculate the Median of a list using RDD (Resilient Distributed Dataset)

python

sc = spark.sparkContext

data = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
rdd = sc.parallelize(data)

sorted_rdd = rdd.sortBy(lambda x: x)
n = sorted_rdd.count()

if n % 2 == 0:
    median = (sorted_rdd.take(n // 2)[-1] + sorted_rdd.take(n // 2 + 1)[0]) / 2
else:
    median = sorted_rdd.take(n // 2 + 1)[-1]

print(f"Median: {median}")

python

Median: 5.0

B. How to calculate the Median of a list using PySpark approxQuantile() function.**

you first need to convert the list into a DataFrame and then use the approxQuantile() function.

python

# Create a list
data_list = [1, 9, 3, 4, 5, 7, 11, 8, 2, 10, 6]

# Convert the list into a DataFrame
data_df = spark.createDataFrame([(value,) for value in data_list], ["values"])

median = data_df.approxQuantile("values", [0.5], 0.01)

# Print the results
for value in median: print(f" Median: {value}")

python

 Median: 6.0

3. Preparing the Sample Data

To demonstrate the different methods of calculating the median, we’ll use a sample dataset containing three columns: id, age, and income. First, let’s load the data into a DataFrame:

python

data = [("1", 25, 40000),
        ("2", 30, 60000),
        ("3", 28, 50000),
        ("4", 35, 70000),
        ("5", 32, 55000)]

columns = ["id", "age", "income"]

df = spark.createDataFrame(data, columns)
df.show()

python

+---+---+------+
| id|age|income|
+---+---+------+
|  1| 25| 40000|
|  2| 30| 60000|
|  3| 28| 50000|
|  4| 35| 70000|
|  5| 32| 55000|
+---+---+------+

4. How to calculate Median of a PySpark DataFrame column

There are several ways to calculate the Median of a DataFrame column in PySpark. We’ll explore three popular methods here

A. How to calculate the Median of a PySaprk DataFrame columns using PySpark approxQuantile() function

python

# Calculate the median for multiple columns
def calculate_median(dataframe, columns):
    medians = {}
    for column in columns:
        # Use approxQuantile to get the median (0.5 quantile)
        median = dataframe.approxQuantile(column, [0.5], 0.0)[0]
        medians[column] = median
    return medians

columns = ["age", "income"]
median_dict = calculate_median(df, columns)

# Print the result
print("Median:")
for column, median in median_dict.items():
    print(f"{column}: {median}")

python

Median:
age: 30.0
income: 55000.0

B. How to calculate the Median of a PySaprk DataFrame column using PySpark Window function

Another method to calculate the median is to use the percent_rank() window function, which assigns a percentile rank to each row within a partition. This method is more accurate than the approxQuantile() but can be slower for large datasets.

python

from pyspark.sql.window import Window
from pyspark.sql.functions import percent_rank, col

# Calculate the median
window_spec = Window.orderBy("age")
df = df.withColumn("percent_rank", percent_rank().over(window_spec))

median = df.filter(col("percent_rank") == 0.5).collect()[0]["age"]

print("Median:", median)

python

Median: 30

Conclusion

We’ve explored different methods for calculating the median in PySpark. Depending on your use case and the size of your dataset, you can choose the method that best suits your needs. As you continue your journey with PySpark, understanding these techniques will undoubtedly serve you well

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Jagdeesh →

Related Course

Master PySpark — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

PySpark Statistics Median – Calculating the Median in PySpark a Comprehensive Guide for Everyone

How to Calcualte Median?

How to use Median in Statistics and Machine Learning

1. Import required libraries and initialize SparkSession

2. How to calculate the Median of a list

A. How to calculate the Median of a list using RDD (Resilient Distributed Dataset)

B. How to calculate the Median of a list using PySpark approxQuantile() function.**

3. Preparing the Sample Data

4. How to calculate Median of a PySpark DataFrame column

A. How to calculate the Median of a PySaprk DataFrame columns using PySpark approxQuantile() function

B. How to calculate the Median of a PySaprk DataFrame column using PySpark Window function

Conclusion

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

How to Calcualte Median?

How to use Median in Statistics and Machine Learning

1. Import required libraries and initialize SparkSession

2. How to calculate the Median of a list

A. How to calculate the Median of a list using RDD (Resilient Distributed Dataset)

B. How to calculate the Median of a list using PySpark approxQuantile() function.**

3. Preparing the Sample Data

4. How to calculate Median of a PySpark DataFrame column

A. How to calculate the Median of a PySaprk DataFrame columns using PySpark approxQuantile() function

B. How to calculate the Median of a PySaprk DataFrame column using PySpark Window function

Conclusion

Related Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

Get Your Free AI/ML Engineer Roadmap

Want help choosing the right AI/ML path?

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science