machine learning +
How to Build a Custom Instruction Dataset for LLM Fine-Tuning
Machine Learning
Understand the key differences between mutual information and cross-entropy. Learn how each measures relationships between probability distributions in ML and information theory.
Cross-entropy is a measure of error, while mutual information measures the shared information between two variable. Both concepts used in information theory, but they serve different purposes and are applied in different contexts.
Let’s understand both in complete detail.
Cross-entropy measures the difference between two probability distributions. Specifically, it quantifies the amount of additional information needed to code samples from a true distribution (P) using a code optimized for an estimated distribution (Q).
Cross-entropy tells you how “off” your predictions are compared to the true outcomes. For example: If you’re trying to guess the weather (sunny or rainy), and you predict sunny with 90% confidence but it’s actually rainy, cross-entropy would give you a score that shows how far off your prediction was.
It’s often used as a loss function in machine learning models to help improve predictions by showing how wrong they are.
$$ H(P, Q) = – \sum_x P(x) \log Q(x) $$
Here, (P(x)) is the true distribution, and (Q(x)) is the estimated distribution.
Cross-entropy is commonly used as a loss function in machine learning, particularly in classification tasks where it measures how well the predicted probability distribution (from the model) matches the true distribution (ground truth).
Interpretation: A lower cross-entropy value indicates that the predicted distribution (Q) is closer to the true distribution (P).
Mutual information measures the amount of information shared between two random variables. It quantifies the reduction in uncertainty about one variable given knowledge of the other.
It tells you how much knowing one thing helps you know another.
For example: If you know it’s raining, mutual information tells you how much that knowledge helps you predict if people are carrying umbrellas. If everyone carries an umbrella when it rains, the mutual information is high.
So, in simple words, it is used to see how strongly two things are related or to pick out important features in data.
$$
I(X; Y) = \sum_{x,y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}
$$
Here, (P(x, y)) is the joint distribution of (X) and (Y), and (P(x)) and (P(y)) are the marginal distributions.
Mutual information is used in feature selection, image registration, clustering, and other areas where understanding the dependence or relationship between variables is crucial.
Mutual information is non-negative, with a higher value indicating a stronger relationship between the variables. If (X) and (Y) are independent, their mutual information is zero.
Mutual Information: Measures the shared information between two variables. It is used to quantify the dependency between variables and is often employed in feature selection and information retrieval. It is symmetrical; (I(X; Y) = I(Y; X)).
Let’s explore both concepts with simple mathematical examples.
Imagine you have a binary classification problem where you’re predicting whether an email is spam (1) or not spam (0). You have the true label ( P ) and your model’s predicted probability ( Q ).
The cross-entropy for a binary classification problem is given by:
$$
H(P, Q) = -[P(y) \log Q(y) + (1 – P(y)) \log (1 – Q(y))]
$$
Substituting the values:
$$
H(P, Q) = -[1 \cdot \log(0.8) + 0 \cdot \log(0.2)]
$$
$$
H(P, Q) = -[\log(0.8)]
$$
$$
H(P, Q) \approx -[-0.223] \approx 0.223 \text{ bits}
$$
So, the cross-entropy is 0.223 bits, which indicates the cost or penalty for the incorrect prediction.
Now, let’s consider two random variables ( X ) and ( Y ):
Mutual information ( I(X; Y) ) is calculated as:
$$
I(X; Y) = \sum_{x,y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}
$$
Let’s compute it step by step:
Similarly calculate for other combinations ( (X=1, Y=0) ), ( (X=0, Y=1) ), and ( (X=0, Y=0) ).
Summing these up gives you the total mutual information. Let’s do that:
Adding these up:
$$
I(X; Y) = 0.0892 – 0.0602 – 0.0602 + 0.0892 = 0.058 bits
$$
Cross-Entropy Example shows how much “wrongness” there is in predicting the value of a given variable.
Mutual Information Example shows how much knowing one variable can help in predicting the other.
That is, Cross-entropy is a measure of error, while mutual information measures the shared information between two variable.
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Get the exact 10-course programming foundation that Data Science professionals use.