Negative Log-Likelihood (NLL) Loss
Going through Kevin Murphy’s Probabilistic Machine Learning, one of the first formulae I stumbled upon was this loss function and I wanted to understand it more.
You can go to: https://sebastianraschka.com/faq/docs/negative-log-likelihood-logistic-loss.html
Explanation
\[\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p(y_i \mid \mathbf{x}_i, \theta)\]Where:
- $\mathcal{L}(\theta)$ is the NLL loss
- $N$ is the number of samples
- $y_i$ is the true label for sample $i$
- $\mathbf{x}_i$ is the input features for sample $i$
- $\theta$ represents the model parameters
- $p(y_i \mid \mathbf{x}_i, \theta)$ is the predicted probability of the true label given the input and parameters
For classification problems, this is often written in cross-entropy form:
\[\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C \mathbb{I}(y_i = c) \log p(y_i = c \mid \mathbf{x}_i, \theta)\]Where:
- $C$ is the number of classes
- $\mathbb{I}(y_i = c)$ is the indicator function: it equals 1 if the true label $y_i$ matches the current class $c$, and 0 otherwise.
Likelihood and Log-Likelihood
NLL loss comes from maximum likelihood estimation (MLE).
For a model predicting probabilities, we want to maximize the likelihood of the correct label:
Working in log-space (log-likelihood) is easier numerically:
\[\log p(y_i \mid x_i, \theta)\]Taking the log turns products into sums, avoids underflow, and preserves monotonicity.
Negative Log-Likelihood (NLL)
Since optimizers usually minimize, we take the negative:
\[L(\theta) = -\log p(y_i \mid x_i, \theta)\]For a batch of ( N ) samples:
\[L(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p(y_i \mid x_i, \theta)\]Connection to Cross-Entropy
In classification, ( p(y_i \mid x_i, \theta) ) is the predicted probability for the true class.
Using a one-hot label ( y_i ) and ( C ) classes:
Here ( \mathbb{I}(y_i = c) ) is 1 if ( y_i = c ), else 0.
This is exactly the cross-entropy loss between true and predicted distributions.
Practical Example (Log-Softmax + NLL)
In PyTorch, NLLLoss
expects log-probabilities, so we apply LogSoftmax first:
Then NLL picks the log-probability of the correct class and negates it:
\[\text{loss} = -\log p(y_i = \text{true class} \mid x_i)\]Why Use NLL?
- Probabilistic interpretation — maximizes likelihood of correct predictions.
- Numerical stability — log-space avoids overflow.
- Gradient behavior — penalizes confident wrong predictions heavily.
Code implementation
All is well and good with this explanation but I wanted to see it implemented in code.
import numpy as np
import torch
# For reproducibility
np.random.seed(42)
def log_softmax(x):
"""Compute log-softmax along the last dimension."""
max_x = np.max(x, axis=1, keepdims=True) # Numerical stability
exp_x = np.exp(x - max_x)
sum_exp_x = np.sum(exp_x, axis=1, keepdims=True)
return (x - max_x) - np.log(sum_exp_x)
def nll_loss(log_probs, targets):
"""Compute mean negative log-likelihood loss."""
n_samples = log_probs.shape[0]
log_probs_target = log_probs[np.arange(n_samples), targets]
return -np.mean(log_probs_target)
# Example input (N=3, C=5)
inputs = np.random.randn(3, 5)
targets = np.array([1, 0, 4])
# NumPy implementation
log_probs_np = log_softmax(inputs)
loss_np = nll_loss(log_probs_np, targets)
print("Input:\n", inputs)
print("Log-Softmax:\n", log_probs_np)
print("Targets:", targets)
print("NumPy Loss:", loss_np)
# Just to be extra sure, we wanted to test it against the real implementation
torch_inputs = torch.tensor(inputs, dtype=torch.float32)
torch_targets = torch.tensor(targets, dtype=torch.long)
log_softmax_torch = torch.nn.LogSoftmax(dim=1)
loss_fn_torch = torch.nn.NLLLoss()
loss_torch = loss_fn_torch(log_softmax_torch(torch_inputs), torch_targets)
print("PyTorch Loss:", loss_torch.item())
Output:
Input:
[[ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337]
[-0.23413696 1.57921282 0.76743473 -0.46947439 0.54256004]
[-0.46341769 -0.46572975 0.24196227 -1.91328024 -1.72491783]]
Log-Softmax:
[[-1.78593751 -2.42091597 -1.63496313 -0.75962181 -2.51680504]
[-2.55085751 -0.73750774 -1.54928582 -2.78619494 -1.77416051]
[-1.51295736 -1.51526942 -0.8075774 -2.96281991 -2.7744575 ]]
Targets: [1 0 4]
NumPy Loss: 2.5820769923174507
PyTorch Loss: 2.5820770263671875
Why subtracting the max improves numerical stability
When computing softmax or log-softmax:
\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]Large values of $x$ can cause exp()
to overflow.
For example, $e^{1000}$ is far larger than what floating-point numbers can store, leading to inf
or nan
.
Key trick: subtract the maximum value $m = \max(x)$ from all elements:
\[\frac{e^{x_i}}{\sum_j e^{x_j}} = \frac{e^{x_i - m}}{\sum_j e^{x_j - m}}\]This does not change the result (since multiplying numerator and denominator by $e^{-m}$ cancels out) but ensures:
- The largest exponent becomes $e^0 = 1$
- All other exponentials are ≤ 1
- No overflow, better floating-point precision
For log-softmax:
\[\log \text{softmax}(x_i) = (x_i - m) - \log \sum_j e^{x_j - m}\]Subtracting $m$ ensures all exp()
calls stay within a safe range, avoiding numerical instability.