PySpark Statistics Mean – Calculating the Mean Using PySpark a Comprehensive Guide for Everyone

Explore different ways of calculating the mean using PySpark, helping you become an expert in no time

Lets explore different ways of calculating the mean using PySpark, helping you become an expert in no time

As data continues to grow exponentially, efficient data processing becomes critical for extracting meaningful insights. PySpark, an Apache Spark library, enables large-scale data processing in Python.

Concept of Mean:

The mean, also known as the average, is a measure of central tendency that represents the sum of a set of values divided by the number of values in that set. The formula for calculating the mean is as follows

Mean (µ) = (X1 + X2 + X3 + X4 + X5) / 5

Mean (µ) = Σ (xi) / N

Where:

µ represents the mean

Σ (xi) denotes the sum of all values (xi) in the dataset

N stands for the number of values in the dataset

let’s dive into the different ways of calculating the mean using PySpark.

How to use MEAN in Statistics and Machine Learning?

In both statistics and machine learning, the mean is a fundamental concept used for various purposes. Here’s how the mean is used in each discipline

Statistics:

In statistics, the mean is a measure of central tendency that helps to summarize a dataset with a single value. It is calculated by summing all the values in the dataset and dividing by the total number of values. The mean has several applications in statistics, including:

a. Descriptive Statistics: The mean provides a basic summary of the data, giving a sense of the overall central location of the values within the dataset.

b. Inferential Statistics: The mean is used in hypothesis testing, confidence intervals, and linear regression to make inferences about the population from which the sample is drawn.

c. Probability Distributions: The mean is a key parameter for many probability distributions, such as the normal, binomial, and Poisson distributions. The mean helps to characterize the center and spread of the distribution.

Machine Learning:

In machine learning, the mean plays a crucial role in various tasks, including:

a. Data Preprocessing: The mean is often used to impute missing values, normalize data, or center the data by subtracting the mean from each value.

b. Feature Engineering: The mean can be used as a feature in machine learning models. For example, the mean value of a variable within groups or over time can provide valuable information for the model.

c. Model Evaluation: In regression tasks, the mean squared error (MSE) or mean absolute error (MAE) are common metrics used to evaluate the performance of a model, both of which involve the mean.

d. Algorithm Development: The mean is used as a key component in various machine learning algorithms, such as k-means clustering, principal component analysis (PCA), and linear regression.

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

python

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Calculating Mean with PySpark") \
    .getOrCreate()

2. How to Calculate the Mean of a list?

you can first convert the list into an RDD (Resilient Distributed Dataset) and then use the mean() function provided by PySpark.

python

# Your list of numbers
data = [1, 2, 3, 4, 5]

# Convert the list to an RDD
sc = spark.sparkContext
mean = sc.parallelize(data).mean()

print("Mean of the list is:", mean)

python

Mean of the list is: 3.0

3. Preparing the Sample Data

To demonstrate the different methods of calculating the mean, we’ll use a sample dataset containing three columns: id, age, and income. First, let’s load the data into a DataFrame:

python

data = [("1", 25, 40000),
        ("2", 30, 60000),
        ("3", 28, 50000),
        ("4", 35, 70000),
        ("5", 32, 55000)]

columns = ["id", "age", "income"]

df = spark.createDataFrame(data, columns)
df.show()

python

+---+---+------+
| id|age|income|
+---+---+------+
|  1| 25| 40000|
|  2| 30| 60000|
|  3| 28| 50000|
|  4| 35| 70000|
|  5| 32| 55000|
+---+---+------+

4. How to calculate Mean of a PySpark DataFrame column?

There are several ways to calculate the mean of a DataFrame column in PySpark. We’ll explore three popular methods here:

A. Using the agg() Function with mean()

python

from pyspark.sql.functions import mean

# Calating Mean of single Column
mean_age = df.agg(mean("age"))

mean_age.show()

python

+--------+
|avg(age)|
+--------+
|    30.0|
+--------+

python

# Calating Mean of Multiple Columns
result = df.agg(mean("age").alias("avg_age"), mean("income").alias("avg_income"))

# Show results
result.show()

python

+-------+----------+
|avg_age|avg_income|
+-------+----------+
|   30.0|   55000.0|
+-------+----------+

python

# Calating Mean using the agg function and a dictionary
agg_dict = {"age": "mean", "income": "mean"}

result = df.agg(agg_dict)

# Show results
result.show()

python

+-----------+--------+
|avg(income)|avg(age)|
+-----------+--------+
|    55000.0|    30.0|
+-----------+--------+

B. How to calculate Mean using describe() Function?

python

mean_age = float(df.describe("age").filter("summary = 'mean'").select("age").collect()[0]["age"])

print(f"Mean Age: {mean_age}")

python

Mean Age: 30.0

Conclusion

We’ve explored three different methods for calculating the mean in PySpark. Depending on your use case and the size of your dataset, you can choose the method that best suits your needs. As you continue your journey with PySpark, understanding these techniques will undoubtedly serve you well

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Jagdeesh →

Related Course

Master PySpark — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

PySpark Statistics Mean – Calculating the Mean Using PySpark a Comprehensive Guide for Everyone

Concept of Mean:

Mean (µ) = (X1 + X2 + X3 + X4 + X5) / 5

Mean (µ) = Σ (xi) / N

How to use MEAN in Statistics and Machine Learning?

Statistics:

Machine Learning:

1. Import required libraries and initialize SparkSession

2. How to Calculate the Mean of a list?

3. Preparing the Sample Data

4. How to calculate Mean of a PySpark DataFrame column?

A. Using the agg() Function with mean()

B. How to calculate Mean using describe() Function?

Conclusion

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Concept of Mean:

Mean (µ) = (X1 + X2 + X3 + X4 + X5) / 5

Mean (µ) = Σ (xi) / N

How to use MEAN in Statistics and Machine Learning?

Statistics:

Machine Learning:

1. Import required libraries and initialize SparkSession

2. How to Calculate the Mean of a list?

3. Preparing the Sample Data

4. How to calculate Mean of a PySpark DataFrame column?

A. Using the agg() Function with mean()

B. How to calculate Mean using describe() Function?

Conclusion

Related Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.