machine learning +
PySpark Exercises – 101 PySpark Exercises for Data Analysis
PySpark Statistics Variance – Understanding Variance a Deep Dive with PySpark
Dive into the concept Variance, the formula to calculate Variance, and how to compute in PySpark, a powerful open-source data processing engine.
Let’s dive into the concept Variance, the formula to calculate Variance, and how to compute in PySpark, a powerful open-source data processing engine.
When analyzing data, it’s essential to understand the underlying concepts of variability and dispersion. Two key measures for this are variance
What is Variance?
Variance is a measure of dispersion in a dataset. It quantifies how far individual data points in a distribution are from the mean. In other words, it tells us how spread out the data points are. A high variance indicates that the data points are far from the mean, while a low variance signifies that the data points are close to the mean.
Population Variance : σ^2 = Σ (xi – μ)^2 / N
Sample Variance : s^2 = Σ (xi – x̄)^2 / (n – 1)
Where:
python
σ is the Standard Deviation of population
s is the Standard Deviation of sample
xi represents EACH data point in the dataset
μ is the mean (average) of the dataset
x̄ is the mean (average) of the dataset
N is the number of data points in the population dataset
n is the number of data points in the sample dataset
Σ denotes the sum of the squared differences between each data point and the mean
Importance of Variance in statistics and machine learning:
A. Data analysis: Variance helps in identifying how much the data points deviate from the mean, providing insights into the data distribution and helping analysts make informed decisions.
B. Model performance: In machine learning, variance is used to measure the error due to the sensitivity of a model to small fluctuations in the training set. High variance indicates overfitting, while low variance suggests underfitting.
C. Feature selection: Variance can be used as a criterion for feature selection in machine learning. Features with low variance may not contribute much to the model’s predictive power, and removing them can help in reducing the complexity of the model.
1. Import required libraries and initialize SparkSession
First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.
python
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import mean, stddev, col
spark = SparkSession.builder \
.appName("Variance") \
.getOrCreate()
2. Preparing the Sample Data
To demonstrate the different methods of calculating the Variance, we’ll use a sample dataset containing three columns. First, let’s load the data into a DataFrame:
python
# Create a sample DataFrame
data = [("A", 10, 15), ("B", 20, 22), ("C", 30, 11), ("D", 40, 8), ("E", 50, 33)]
columns = ["Name", "Score_1", "Score_2"]
df = spark.createDataFrame(data, columns)
df.show()
python
+----+-------+-------+
|Name|Score_1|Score_2|
+----+-------+-------+
| A| 10| 15|
| B| 20| 22|
| C| 30| 11|
| D| 40| 8|
| E| 50| 33|
+----+-------+-------+
3. How to calculate Variance of list using PySpark RDD’s variance() function
python
data = [10, 20, 30, 40, 50]
rdd = spark.sparkContext.parallelize(data)
variance = rdd.variance()
print("Population Variance:", variance)
python
Population Variance: 200.0
Manually calculating sample variance using RDD’s map() and reduce() functions
python
data = [10, 20, 30, 40, 50]
rdd = spark.sparkContext.parallelize(data)
mean = rdd.mean()
n = rdd.count()
variance = rdd.map(lambda x: (x - mean) ** 2).reduce(lambda x, y: x + y) / (n - 1)
print("Sample Variance:", variance)
python
Sample Variance: 250.0
4. How to calculate Variance of PySpark DataFrame columns
PySpark is an open-source, distributed computing system that provides a fast and general-purpose cluster-computing framework for big data processing. Here are different ways to calculate Variance using PySpark:
A. Using DataFrame’s agg() function with built-in variance() function
python
from pyspark.sql.functions import var_pop
from pyspark.sql.functions import var_samp
variance_pop = df.agg(var_pop("Score_1").alias("Population Variance"))
variance_samp = df.agg(var_samp("Score_1").alias("Sample Variance"))
variance_pop.show()
variance_samp.show()
python
+-------------------+
|Population Variance|
+-------------------+
| 200.0|
+-------------------+
+---------------+
|Sample Variance|
+---------------+
| 250.0|
+---------------+
B. Using DataFrame’s describe() function and manually calculating variance
python
summary_stats = df.describe()
# Calculate mean and count from the summary statistics
mean = float(summary_stats.filter(col("summary") == "mean").select("Score_1").collect()[0][0])
count = int(summary_stats.filter(col("summary") == "count").select("Score_1").collect()[0][0])
variance = df.select(((col("Score_1") - mean) ** 2).alias("squared_difference")) \
.agg({"squared_difference": "sum"}) \
.collect()[0][0] / count
print("Population Variance:", variance)
python
Population Variance: 200.0
C. Using the selectExpr() function with SQL expressions
python
# Calculate Standard Deviation (population)
variance_pop = df.selectExpr("var_pop(Score_1)").collect()[0][0]
# Calculate Standard Deviation (sample)
variance_samp = df.selectExpr("var_samp(Score_1)").collect()[0][0]
# Print the result
print("Population Variance:", variance_pop)
print("Sample Variance:", variance_samp)
python
Population Variance: 200.0
Sample Variance: 250.0
Conclusion
Understanding variance is crucial for interpreting the variability and dispersion of data. PySpark offers a robust and scalable solution to compute these measures for large datasets. By following the steps outlined in this blog post, you can effectively analyze your data and draw meaningful insights from it.
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course

