Mutual information vs Cross Entropy

Understand the key differences between mutual information and cross-entropy. Learn how each measures relationships between probability distributions in ML and information theory.

Written by Selva Prabhakaran | 4 min read

Cross-entropy is a measure of error, while mutual information measures the shared information between two variable. Both concepts used in information theory, but they serve different purposes and are applied in different contexts.

Let’s understand both in complete detail.

Cross-Entropy

Cross-entropy measures the difference between two probability distributions. Specifically, it quantifies the amount of additional information needed to code samples from a true distribution (P) using a code optimized for an estimated distribution (Q).

Cross-entropy tells you how “off” your predictions are compared to the true outcomes. For example: If you’re trying to guess the weather (sunny or rainy), and you predict sunny with 90% confidence but it’s actually rainy, cross-entropy would give you a score that shows how far off your prediction was.

It’s often used as a loss function in machine learning models to help improve predictions by showing how wrong they are.

Mathematical Formulation:

$$ H(P, Q) = – \sum_x P(x) \log Q(x) $$
Here, (P(x)) is the true distribution, and (Q(x)) is the estimated distribution.

Cross-entropy is commonly used as a loss function in machine learning, particularly in classification tasks where it measures how well the predicted probability distribution (from the model) matches the true distribution (ground truth).

Interpretation: A lower cross-entropy value indicates that the predicted distribution (Q) is closer to the true distribution (P).

Mutual Information

Mutual information measures the amount of information shared between two random variables. It quantifies the reduction in uncertainty about one variable given knowledge of the other.

It tells you how much knowing one thing helps you know another.

For example: If you know it’s raining, mutual information tells you how much that knowledge helps you predict if people are carrying umbrellas. If everyone carries an umbrella when it rains, the mutual information is high.

So, in simple words, it is used to see how strongly two things are related or to pick out important features in data.

Mathematical Formulation:

$$
I(X; Y) = \sum_{x,y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}
$$
Here, (P(x, y)) is the joint distribution of (X) and (Y), and (P(x)) and (P(y)) are the marginal distributions.

Mutual information is used in feature selection, image registration, clustering, and other areas where understanding the dependence or relationship between variables is crucial.

Mutual information is non-negative, with a higher value indicating a stronger relationship between the variables. If (X) and (Y) are independent, their mutual information is zero.

Key Differences

Cross-Entropy: Measures the difference between two distributions and is often used as a loss function in classification problems. It is asymmetrical, that is, it depends on the direction of (P) and (Q).
Mutual Information: Measures the shared information between two variables. It is used to quantify the dependency between variables and is often employed in feature selection and information retrieval. It is symmetrical; (I(X; Y) = I(Y; X)).

Let’s explore both concepts with simple mathematical examples.

Cross-Entropy Example Calculation

Imagine you have a binary classification problem where you’re predicting whether an email is spam (1) or not spam (0). You have the true label ( P ) and your model’s predicted probability ( Q ).

Data:

True label ( P(y) ): 1 (the email is spam).
Predicted probability ( Q(y) ): 0.8 (your model predicts it’s spam with 80% confidence).

The cross-entropy for a binary classification problem is given by:
$$
H(P, Q) = -[P(y) \log Q(y) + (1 – P(y)) \log (1 – Q(y))]
$$
Substituting the values:
$$
H(P, Q) = -[1 \cdot \log(0.8) + 0 \cdot \log(0.2)]
$$
$$
H(P, Q) = -[\log(0.8)]
$$
$$
H(P, Q) \approx -[-0.223] \approx 0.223 \text{ bits}
$$
So, the cross-entropy is 0.223 bits, which indicates the cost or penalty for the incorrect prediction.

Mutual Information Example Calculation

Now, let’s consider two random variables ( X ) and ( Y ):

( X ): Whether it rains (0 = no, 1 = yes).
( Y ): Whether people carry an umbrella (0 = no, 1 = yes).

Data (Joint Distribution):

Probability that it rains and people carry an umbrella: ( P(X=1, Y=1) = 0.4 ).
Probability that it rains and people don’t carry an umbrella: ( P(X=1, Y=0) = 0.1 ).
Probability that it doesn’t rain and people carry an umbrella: ( P(X=0, Y=1) = 0.1 ).
Probability that it doesn’t rain and people don’t carry an umbrella: ( P(X=0, Y=0) = 0.4 ).

Marginal Probabilities:

( P(X=1) = P(X=1, Y=1) + P(X=1, Y=0) = 0.4 + 0.1 = 0.5 ) (Probability of rain).
( P(Y=1) = P(X=1, Y=1) + P(X=0, Y=1) = 0.4 + 0.1 = 0.5 ) (Probability of carrying an umbrella).

Mutual Information Calculation:

Mutual information ( I(X; Y) ) is calculated as:
$$
I(X; Y) = \sum_{x,y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}
$$
Let’s compute it step by step:

For ( X=1, Y=1 ):
$$
P(X=1, Y=1) \log \frac{P(X=1, Y=1)}{P(X=1)P(Y=1)} = 0.4 \log \frac{0.4}{0.5 \times 0.5} = 0.4 \log \frac{0.4}{0.25} = 0.4 \log 1.6
$$
$$
= 0.4 \times 0.223 = 0.0892 \text{ bits}
$$
Similarly calculate for other combinations ( (X=1, Y=0) ), ( (X=0, Y=1) ), and ( (X=0, Y=0) ).

Summing these up gives you the total mutual information. Let’s do that:

For ( X=1, Y=0 ): ( 0.1 \log \frac{0.1}{0.5 \times 0.5} = 0.1 \log 0.4 = -0.0602 ) bits.
For ( X=0, Y=1 ): ( 0.1 \log \frac{0.1}{0.5 \times 0.5} = -0.0602 ) bits.
For ( X=0, Y=0 ): ( 0.4 \log \frac{0.4}{0.5 \times 0.5} = 0.0892 ) bits.

Adding these up:
$$
I(X; Y) = 0.0892 – 0.0602 – 0.0602 + 0.0892 = 0.058 bits
$$

Summary

Cross-Entropy Example shows how much “wrongness” there is in predicting the value of a given variable.

Mutual Information Example shows how much knowing one variable can help in predicting the other.

That is, Cross-entropy is a measure of error, while mutual information measures the shared information between two variable.

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Machine Learning — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Mutual information vs Cross Entropy

Cross-Entropy

Mathematical Formulation:

Mutual Information

Mathematical Formulation:

Key Differences

Cross-Entropy Example Calculation

Data:

Mutual Information Example Calculation

Data (Joint Distribution):

Marginal Probabilities:

Mutual Information Calculation:

Summary

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Cross-Entropy

Mathematical Formulation:

Mutual Information

Mathematical Formulation:

Key Differences

Cross-Entropy Example Calculation

Data:

Mutual Information Example Calculation

Data (Joint Distribution):

Marginal Probabilities:

Mutual Information Calculation:

Summary

Related Articles

How to Build a Custom Instruction Dataset for LLM Fine-Tuning

Build a Custom Scikit-Learn Regression Model: Step-by-Step Guide

Complete Data Science and AI Roadmap by ML+

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science