PySpark Statistics Variance – Understanding Variance a Deep Dive with PySpark

Dive into the concept Variance, the formula to calculate Variance, and how to compute in PySpark, a powerful open-source data processing engine.

Written by Jagdeesh | 4 min read

Let’s dive into the concept Variance, the formula to calculate Variance, and how to compute in PySpark, a powerful open-source data processing engine.

When analyzing data, it’s essential to understand the underlying concepts of variability and dispersion. Two key measures for this are variance

What is Variance?

Variance is a measure of dispersion in a dataset. It quantifies how far individual data points in a distribution are from the mean. In other words, it tells us how spread out the data points are. A high variance indicates that the data points are far from the mean, while a low variance signifies that the data points are close to the mean.

Population Variance : σ^2 = Σ (xi – μ)^2 / N

Sample Variance : s^2 = Σ (xi – x̄)^2 / (n – 1)

Where:

python

σ is the Standard Deviation of population
s is the Standard Deviation of sample
xi represents EACH data point in the dataset
μ is the mean (average) of the dataset
x̄ is the mean (average) of the dataset
N is the number of data points in the population dataset
n is the number of data points in the sample dataset
Σ denotes the sum of the squared differences between each data point and the mean

Importance of Variance in statistics and machine learning:

A. Data analysis: Variance helps in identifying how much the data points deviate from the mean, providing insights into the data distribution and helping analysts make informed decisions.

B. Model performance: In machine learning, variance is used to measure the error due to the sensitivity of a model to small fluctuations in the training set. High variance indicates overfitting, while low variance suggests underfitting.

C. Feature selection: Variance can be used as a criterion for feature selection in machine learning. Features with low variance may not contribute much to the model’s predictive power, and removing them can help in reducing the complexity of the model.

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

python

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import mean, stddev, col

spark = SparkSession.builder \
    .appName("Variance") \
    .getOrCreate()

2. Preparing the Sample Data

To demonstrate the different methods of calculating the Variance, we’ll use a sample dataset containing three columns. First, let’s load the data into a DataFrame:

python

# Create a sample DataFrame
data = [("A", 10, 15), ("B", 20, 22), ("C", 30, 11), ("D", 40, 8), ("E", 50, 33)]
columns = ["Name", "Score_1", "Score_2"]
df = spark.createDataFrame(data, columns)

df.show()

python

+----+-------+-------+
|Name|Score_1|Score_2|
+----+-------+-------+
|   A|     10|     15|
|   B|     20|     22|
|   C|     30|     11|
|   D|     40|      8|
|   E|     50|     33|
+----+-------+-------+

3. How to calculate Variance of list using PySpark RDD’s variance() function

python

data = [10, 20, 30, 40, 50]
rdd = spark.sparkContext.parallelize(data)

variance = rdd.variance()
print("Population Variance:", variance)

python

Population Variance: 200.0

Manually calculating sample variance using RDD’s map() and reduce() functions

python

data = [10, 20, 30, 40, 50]
rdd = spark.sparkContext.parallelize(data)

mean = rdd.mean()
n = rdd.count()

variance = rdd.map(lambda x: (x - mean) ** 2).reduce(lambda x, y: x + y) / (n - 1)
print("Sample Variance:", variance)

python

Sample Variance: 250.0

4. How to calculate Variance of PySpark DataFrame columns

PySpark is an open-source, distributed computing system that provides a fast and general-purpose cluster-computing framework for big data processing. Here are different ways to calculate Variance using PySpark:

A. Using DataFrame’s agg() function with built-in variance() function

python

from pyspark.sql.functions import var_pop
from pyspark.sql.functions import var_samp

variance_pop = df.agg(var_pop("Score_1").alias("Population Variance"))

variance_samp = df.agg(var_samp("Score_1").alias("Sample Variance"))

variance_pop.show()
variance_samp.show()

python

+-------------------+
|Population Variance|
+-------------------+
|              200.0|
+-------------------+

+---------------+
|Sample Variance|
+---------------+
|          250.0|
+---------------+

B. Using DataFrame’s describe() function and manually calculating variance

python

summary_stats = df.describe()

# Calculate mean and count from the summary statistics
mean = float(summary_stats.filter(col("summary") == "mean").select("Score_1").collect()[0][0])
count = int(summary_stats.filter(col("summary") == "count").select("Score_1").collect()[0][0])

variance = df.select(((col("Score_1") - mean) ** 2).alias("squared_difference")) \
    .agg({"squared_difference": "sum"}) \
    .collect()[0][0] / count

print("Population Variance:", variance)

python

Population Variance: 200.0

C. Using the selectExpr() function with SQL expressions

python

# Calculate Standard Deviation (population)
variance_pop = df.selectExpr("var_pop(Score_1)").collect()[0][0]

# Calculate Standard Deviation (sample)
variance_samp = df.selectExpr("var_samp(Score_1)").collect()[0][0]

# Print the result
print("Population Variance:", variance_pop)
print("Sample Variance:", variance_samp)

python

Population Variance: 200.0
Sample Variance: 250.0

Conclusion

Understanding variance is crucial for interpreting the variability and dispersion of data. PySpark offers a robust and scalable solution to compute these measures for large datasets. By following the steps outlined in this blog post, you can effectively analyze your data and draw meaningful insights from it.

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Jagdeesh →

Related Course

Master PySpark — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

PySpark Statistics Variance – Understanding Variance a Deep Dive with PySpark

What is Variance?

Population Variance : σ^2 = Σ (xi – μ)^2 / N

Sample Variance : s^2 = Σ (xi – x̄)^2 / (n – 1)

Importance of Variance in statistics and machine learning:

1. Import required libraries and initialize SparkSession

2. Preparing the Sample Data

3. How to calculate Variance of list using PySpark RDD’s variance() function

4. How to calculate Variance of PySpark DataFrame columns

A. Using DataFrame’s agg() function with built-in variance() function

B. Using DataFrame’s describe() function and manually calculating variance

C. Using the selectExpr() function with SQL expressions

Conclusion

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What is Variance?

Population Variance : σ^2 = Σ (xi – μ)^2 / N

Sample Variance : s^2 = Σ (xi – x̄)^2 / (n – 1)

Importance of Variance in statistics and machine learning:

1. Import required libraries and initialize SparkSession

2. Preparing the Sample Data

3. How to calculate Variance of list using PySpark RDD’s variance() function

4. How to calculate Variance of PySpark DataFrame columns

A. Using DataFrame’s agg() function with built-in variance() function

B. Using DataFrame’s describe() function and manually calculating variance

C. Using the selectExpr() function with SQL expressions

Conclusion

Related Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.