Menu

F Statistic Formula – Explained

Written by Selva Prabhakaran | 6 min read

The F statistic is used in statistical hypothesis testing to determine if there are significant differences between group means. It is most commonly used in ANOVA (Analysis of Variance) but also appears in regression analysis.

Let’s understand F-Statistic in the context of ANOVA first.

1. F Statistic in ANOVA (Analysis of Variance)

When you want to check if different groups (like different treatment groups in an experiment) have different average values, you use ANOVA.

What the F Statistic Does is this: It compares how much the group averages differ from each other (between groups) to how much variation is within each group.

Between Groups: First, it measures how much the group averages (means) differ from the overall average of all the data.

Within Groups: Then, it measures how much each individual data point differs from its own group’s average.

F Statistic: It divides the variation between groups by the variation within groups. If the groups are really different, this number will be large.

If not, it will be small.

In simple terms: The F statistic tells you if the differences in group averages are big enough to be considered significant, or if they could just be due to random chance.

In ANOVA, the F statistic is used to test if there are significant differences between group means. The formula for the F statistic is:

$$F = \frac{\text{MSB}}{\text{MSW}}$$

where:
Mean Square Between (MSB) is the average variation between group means:
$$ \text{MSB} = \frac{\text{SSB}}{\text{dfB}} $$

Mean Square Within (MSW) is the average variation within groups:
$$ \text{MSW} = \frac{\text{SSW}}{\text{dfW}} $$

Let’s break these down further:

Sum of Squares Between (SSB):

SSB measures the variation due to the difference between group means. It is calculated as:
$$[ \text{SSB} = \sum_{i=1}^{k} n_i (\bar{Y}_i – \bar{Y})^2 ]$$

where:
$( k )$ is the number of groups.
$( n_i )$ is the number of observations in group $( i )$.
$( \bar{Y}_i )$ is the mean of group $( i )$.
$( \bar{Y} )$ is the overall mean of all observations.

Degrees of Freedom Between (dfB):

dfB is the number of groups minus one:
$ \text{dfB} = k – 1 $

Sum of Squares Within (SSW):

SSW measures the variation within each group. It is calculated as:
$$ \text{SSW} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (Y_{ij} – \bar{Y}_i)^2 $$

where:
$( Y_{ij} )$ is the $( j )$-th observation in group $( i )$

Degrees of Freedom Within (dfW):

dfW is the total number of observations minus the number of groups:
$ \text{dfW} = N – k $

where $( N )$ is the total number of observations.

Example Scenario

Scenario: Suppose you’re a researcher studying the effects of three different diets on weight loss. You have three groups of people, each following a different diet for 6 months. At the end of the study, you measure the average weight loss in each group.

Objective: You want to determine if the average weight loss differs significantly between the three diet groups.

Steps:

  1. Calculate the Mean Weight Loss for Each Group: Find the average weight loss for each diet group.
  2. Calculate the Overall Mean Weight Loss: Combine all the data and find the average weight loss across all groups.
  3. Calculate the Variability:
    Between Groups: Measure how much the average weight loss of each group differs from the overall average.
    Within Groups: Measure how much weight loss varies within each group itself.
    Compute the F Statistic: Compare the variability between the groups to the variability within the groups.

Interpretation:

If the F statistic is large, it suggests that the differences in average weight loss between the diet groups are significant. If it’s small, the diet groups might not be very different from each other.

2. F Statistic in Regression Analysis

When you want to see if one or more variables (like hours studied) can predict another variable (like test scores), you use regression analysis.

What the F Statistic Does: It checks if the overall model (the combination of predictors) explains a significant amount of the variation in the outcome.

Steps:

Explained Variation: It looks at how much of the variation in the outcome can be explained by the predictors.

Unexplained Variation: It also looks at how much variation is left unexplained by the model.

F Statistic: It divides the explained variation by the unexplained variation. If the predictors explain a lot of the outcome, this number will be large. If not, it will be small.

In simple terms: The F statistic tells you if your model (the predictors) does a good job at explaining the outcome, or if any improvements are just due to random luck.

In regression analysis, the F statistic is used to test if the overall regression model is significant. The formula for the F statistic is:

$[ F = \frac{\text{MSR}}{\text{MSRes}} ]$

where:
Mean Square Regression (MSR) is the average variation explained by the regression model:
$ [ \text{MSR} = \frac{\text{SSR}}{\text{dfR}} ]$

Mean Square Residual (MSRes) is the average variation not explained by the model:
$ [ \text{MSRes} = \frac{\text{SSE}}{\text{dfE}} ]$

Let’s break these down further:

Sum of Squares Regression (SSR):

SSR measures the variation explained by the regression model. It is calculated as:
$[ \text{SSR} = \sum_{i=1}^{N} (\hat{Y}_i – \bar{Y})^2 ]$

where:
$( \hat{Y}_i )$ is the predicted value for observation $( i )$.
$( \bar{Y} )$ is the mean of the observed values.

Degrees of Freedom Regression (dfR):

dfR is the number of predictors (including the intercept if you include it):
$[ \text{dfR} = p ]$

where $( p )$ is the number of predictors in the model.

Sum of Squares Residual (SSE):

SSE measures the variation not explained by the model. It is calculated as:
$ \text{SSE} = \sum_{i=1}^{N} (Y_i – \hat{Y}_i)^2 $

where:
$( Y_i )$ is the observed value for observation $( i )$.

Degrees of Freedom Residual (dfE):

dfE is the total number of observations minus the number of predictors:
$ \text{dfE} = N – p – 1 $

where N is the number of observations and p is the number of predictors.

In both contexts, the F statistic tests the null hypothesis that the group means or regression model coefficients are equal, against the alternative that there are significant differences or effects.

Example Scenario

Scenario: Imagine you’re an analyst trying to predict a company’s sales based on their advertising expenditure. You collect data on monthly sales and advertising spend for the past year.

Objective: You want to see if advertising expenditure significantly predicts sales.

Steps:

  1. Fit a Regression Model: Use the data to create a regression model where advertising spend is the predictor and sales is the outcome.
  2. Calculate the Explained Variation: Measure how much of the variation in sales is explained by advertising expenditure.
  3. Calculate the Unexplained Variation: Measure how much variation in sales is not explained by the model.
  4. Compute the F Statistic: Compare the explained variation to the unexplained variation.

Interpretation:

If the F statistic is large, it suggests that advertising expenditure significantly improves the prediction of sales. If it’s small, advertising spend might not be a strong predictor of sales.

In both cases, the F statistic helps you understand if your results are likely due to a real effect or if they could be due to random chance.

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Statistics — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science