Menu

PySpark Statistics Variance – Understanding Variance a Deep Dive with PySpark

Written by Jagdeesh | 4 min read

Let’s dive into the concept Variance, the formula to calculate Variance, and how to compute in PySpark, a powerful open-source data processing engine.

When analyzing data, it’s essential to understand the underlying concepts of variability and dispersion. Two key measures for this are variance

What is Variance?

Variance is a measure of dispersion in a dataset. It quantifies how far individual data points in a distribution are from the mean. In other words, it tells us how spread out the data points are. A high variance indicates that the data points are far from the mean, while a low variance signifies that the data points are close to the mean.

Population Variance : Οƒ^2 = Ξ£ (xi – ΞΌ)^2 / N

Sample Variance : s^2 = Ξ£ (xi – xΜ„)^2 / (n – 1)

Where:

python
Οƒ is the Standard Deviation of population
s is the Standard Deviation of sample
xi represents EACH data point in the dataset
ΞΌ is the mean (average) of the dataset
xΜ„ is the mean (average) of the dataset
N is the number of data points in the population dataset
n is the number of data points in the sample dataset
Ξ£ denotes the sum of the squared differences between each data point and the mean

Importance of Variance in statistics and machine learning:

A. Data analysis: Variance helps in identifying how much the data points deviate from the mean, providing insights into the data distribution and helping analysts make informed decisions.

B. Model performance: In machine learning, variance is used to measure the error due to the sensitivity of a model to small fluctuations in the training set. High variance indicates overfitting, while low variance suggests underfitting.

C. Feature selection: Variance can be used as a criterion for feature selection in machine learning. Features with low variance may not contribute much to the model’s predictive power, and removing them can help in reducing the complexity of the model.

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

python
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import mean, stddev, col

spark = SparkSession.builder \
    .appName("Variance") \
    .getOrCreate()

2. Preparing the Sample Data

To demonstrate the different methods of calculating the Variance, we’ll use a sample dataset containing three columns. First, let’s load the data into a DataFrame:

python
# Create a sample DataFrame
data = [("A", 10, 15), ("B", 20, 22), ("C", 30, 11), ("D", 40, 8), ("E", 50, 33)]
columns = ["Name", "Score_1", "Score_2"]
df = spark.createDataFrame(data, columns)

df.show()
python
+----+-------+-------+
|Name|Score_1|Score_2|
+----+-------+-------+
|   A|     10|     15|
|   B|     20|     22|
|   C|     30|     11|
|   D|     40|      8|
|   E|     50|     33|
+----+-------+-------+

3. How to calculate Variance of list using PySpark RDD’s variance() function

python
data = [10, 20, 30, 40, 50]
rdd = spark.sparkContext.parallelize(data)

variance = rdd.variance()
print("Population Variance:", variance)
python
Population Variance: 200.0

Manually calculating sample variance using RDD’s map() and reduce() functions

python
data = [10, 20, 30, 40, 50]
rdd = spark.sparkContext.parallelize(data)

mean = rdd.mean()
n = rdd.count()

variance = rdd.map(lambda x: (x - mean) ** 2).reduce(lambda x, y: x + y) / (n - 1)
print("Sample Variance:", variance)
python
Sample Variance: 250.0

4. How to calculate Variance of PySpark DataFrame columns

PySpark is an open-source, distributed computing system that provides a fast and general-purpose cluster-computing framework for big data processing. Here are different ways to calculate Variance using PySpark:

A. Using DataFrame’s agg() function with built-in variance() function

python
from pyspark.sql.functions import var_pop
from pyspark.sql.functions import var_samp

variance_pop = df.agg(var_pop("Score_1").alias("Population Variance"))

variance_samp = df.agg(var_samp("Score_1").alias("Sample Variance"))

variance_pop.show()
variance_samp.show()
python
+-------------------+
|Population Variance|
+-------------------+
|              200.0|
+-------------------+

+---------------+
|Sample Variance|
+---------------+
|          250.0|
+---------------+

B. Using DataFrame’s describe() function and manually calculating variance

python
summary_stats = df.describe()

# Calculate mean and count from the summary statistics
mean = float(summary_stats.filter(col("summary") == "mean").select("Score_1").collect()[0][0])
count = int(summary_stats.filter(col("summary") == "count").select("Score_1").collect()[0][0])

variance = df.select(((col("Score_1") - mean) ** 2).alias("squared_difference")) \
    .agg({"squared_difference": "sum"}) \
    .collect()[0][0] / count

print("Population Variance:", variance)
python
Population Variance: 200.0

C. Using the selectExpr() function with SQL expressions

python
# Calculate Standard Deviation (population)
variance_pop = df.selectExpr("var_pop(Score_1)").collect()[0][0]

# Calculate Standard Deviation (sample)
variance_samp = df.selectExpr("var_samp(Score_1)").collect()[0][0]

# Print the result
print("Population Variance:", variance_pop)
print("Sample Variance:", variance_samp)
python
Population Variance: 200.0
Sample Variance: 250.0

Conclusion

Understanding variance is crucial for interpreting the variability and dispersion of data. PySpark offers a robust and scalable solution to compute these measures for large datasets. By following the steps outlined in this blog post, you can effectively analyze your data and draw meaningful insights from it.

Free Course
Master Core Python β€” Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Jagdeesh
Written by
Related Course
Master PySpark β€” Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
πŸ“š 10 Courses
🐍 Python & ML
πŸ—„οΈ SQL
πŸ“¦ Downloads
πŸ“… 1 Year Access
No thanks
πŸŽ“
Free AI/ML Starter Kit
Python Β· SQL Β· ML Β· 10 Courses Β· 57,000+ students
πŸŽ‰   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚑ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python β€” from first line to expert level
πŸ“ˆ
NumPy & Pandas β€” the #1 libraries every DS job needs
πŸ—ƒοΈ
SQL Levels I–III β€” basics to Window Functions
πŸ“„
Real industry data β€” Jupyter notebooks included
R A M S K
57,000+ students
β˜…β˜…β˜…β˜…β˜… Rated 4.9/5
⚑ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  β˜…β˜…β˜…β˜…β˜… 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
πŸ”’ 100% free β˜• No spam, ever βœ“ Instant access
πŸš€
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course β†’
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Zβ„’: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Zβ„’: Hands-On Python & R In Data Science

Machine Learning A-Zβ„’: Hands-On Python & R In Data Science

Machine Learning A-Zβ„’: Hands-On Python & R In Data Science

Machine Learning A-Zβ„’: Hands-On Python & R In Data Science

Machine Learning A-Zβ„’: Hands-On Python & R In Data Science