Menu

PySpark Statistics Mean – Calculating the Mean Using PySpark a Comprehensive Guide for Everyone

Written by Jagdeesh | 4 min read

Lets explore different ways of calculating the mean using PySpark, helping you become an expert in no time

As data continues to grow exponentially, efficient data processing becomes critical for extracting meaningful insights. PySpark, an Apache Spark library, enables large-scale data processing in Python.

Concept of Mean:

The mean, also known as the average, is a measure of central tendency that represents the sum of a set of values divided by the number of values in that set. The formula for calculating the mean is as follows

Mean (ยต) = (X1 + X2 + X3 + X4 + X5) / 5

Mean (ยต) = ฮฃ (xi) / N

Where:

ยต represents the mean

ฮฃ (xi) denotes the sum of all values (xi) in the dataset

N stands for the number of values in the dataset

let’s dive into the different ways of calculating the mean using PySpark.

How to use MEAN in Statistics and Machine Learning?

In both statistics and machine learning, the mean is a fundamental concept used for various purposes. Here’s how the mean is used in each discipline

Statistics:

In statistics, the mean is a measure of central tendency that helps to summarize a dataset with a single value. It is calculated by summing all the values in the dataset and dividing by the total number of values. The mean has several applications in statistics, including:

a. Descriptive Statistics: The mean provides a basic summary of the data, giving a sense of the overall central location of the values within the dataset.

b. Inferential Statistics: The mean is used in hypothesis testing, confidence intervals, and linear regression to make inferences about the population from which the sample is drawn.

c. Probability Distributions: The mean is a key parameter for many probability distributions, such as the normal, binomial, and Poisson distributions. The mean helps to characterize the center and spread of the distribution.

Machine Learning:

In machine learning, the mean plays a crucial role in various tasks, including:

a. Data Preprocessing: The mean is often used to impute missing values, normalize data, or center the data by subtracting the mean from each value.

b. Feature Engineering: The mean can be used as a feature in machine learning models. For example, the mean value of a variable within groups or over time can provide valuable information for the model.

c. Model Evaluation: In regression tasks, the mean squared error (MSE) or mean absolute error (MAE) are common metrics used to evaluate the performance of a model, both of which involve the mean.

d. Algorithm Development: The mean is used as a key component in various machine learning algorithms, such as k-means clustering, principal component analysis (PCA), and linear regression.

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

python
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Calculating Mean with PySpark") \
    .getOrCreate()

2. How to Calculate the Mean of a list?

you can first convert the list into an RDD (Resilient Distributed Dataset) and then use the mean() function provided by PySpark.

python
# Your list of numbers
data = [1, 2, 3, 4, 5]

# Convert the list to an RDD
sc = spark.sparkContext
mean = sc.parallelize(data).mean()

print("Mean of the list is:", mean)
python
Mean of the list is: 3.0

3. Preparing the Sample Data

To demonstrate the different methods of calculating the mean, we’ll use a sample dataset containing three columns: id, age, and income. First, let’s load the data into a DataFrame:

python
data = [("1", 25, 40000),
        ("2", 30, 60000),
        ("3", 28, 50000),
        ("4", 35, 70000),
        ("5", 32, 55000)]

columns = ["id", "age", "income"]

df = spark.createDataFrame(data, columns)
df.show()
python
+---+---+------+
| id|age|income|
+---+---+------+
|  1| 25| 40000|
|  2| 30| 60000|
|  3| 28| 50000|
|  4| 35| 70000|
|  5| 32| 55000|
+---+---+------+

4. How to calculate Mean of a PySpark DataFrame column?

There are several ways to calculate the mean of a DataFrame column in PySpark. We’ll explore three popular methods here:

A. Using the agg() Function with mean()

python
from pyspark.sql.functions import mean

# Calating Mean of single Column
mean_age = df.agg(mean("age"))

mean_age.show()
python
+--------+
|avg(age)|
+--------+
|    30.0|
+--------+
python
# Calating Mean of Multiple Columns
result = df.agg(mean("age").alias("avg_age"), mean("income").alias("avg_income"))

# Show results
result.show()
python
+-------+----------+
|avg_age|avg_income|
+-------+----------+
|   30.0|   55000.0|
+-------+----------+
python
# Calating Mean using the agg function and a dictionary
agg_dict = {"age": "mean", "income": "mean"}

result = df.agg(agg_dict)

# Show results
result.show()

python
+-----------+--------+
|avg(income)|avg(age)|
+-----------+--------+
|    55000.0|    30.0|
+-----------+--------+

B. How to calculate Mean using describe() Function?

python
mean_age = float(df.describe("age").filter("summary = 'mean'").select("age").collect()[0]["age"])

print(f"Mean Age: {mean_age}")
python
Mean Age: 30.0

Conclusion

We’ve explored three different methods for calculating the mean in PySpark. Depending on your use case and the size of your dataset, you can choose the method that best suits your needs. As you continue your journey with PySpark, understanding these techniques will undoubtedly serve you well

Free Course
Master Core Python โ€” Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark โ€” Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
๐Ÿ“š 10 Courses
๐Ÿ Python & ML
๐Ÿ—„๏ธ SQL
๐Ÿ“ฆ Downloads
๐Ÿ“… 1 Year Access
No thanks
๐ŸŽ“
Free AI/ML Starter Kit
Python ยท SQL ยท ML ยท 10 Courses ยท 57,000+ students
๐ŸŽ‰   You're in! Check your inbox (or Promotions/Spam) for the access link.
โšก Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

๐Ÿ
Core Python โ€” from first line to expert level
๐Ÿ“ˆ
NumPy & Pandas โ€” the #1 libraries every DS job needs
๐Ÿ—ƒ๏ธ
SQL Levels Iโ€“III โ€” basics to Window Functions
๐Ÿ“„
Real industry data โ€” Jupyter notebooks included
R A M S K
57,000+ students
โ˜…โ˜…โ˜…โ˜…โ˜… Rated 4.9/5
โšก Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  โ˜…โ˜…โ˜…โ˜…โ˜… 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
๐Ÿ”’ 100% free โ˜• No spam, ever โœ“ Instant access
๐Ÿš€
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course โ†’
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science

Machine Learning A-Zโ„ข: Hands-On Python & R In Data Science