August 12, 2025

Negative Log-Likelihood (NLL) Loss

Going through Kevin Murphy’s Probabilistic Machine Learning, one of the first formulae I stumbled upon was this loss function and I wanted to understand it more.

You can go to: https://sebastianraschka.com/faq/docs/negative-log-likelihood-logistic-loss.html

Explanation

\[\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p(y_i \mid \mathbf{x}_i, \theta)\]

Where:

$\mathcal{L}(\theta)$ is the NLL loss
$N$ is the number of samples
$y_i$ is the true label for sample $i$
$\mathbf{x}_i$ is the input features for sample $i$
$\theta$ represents the model parameters
$p(y_i \mid \mathbf{x}_i, \theta)$ is the predicted probability of the true label given the input and parameters

For classification problems, this is often written in cross-entropy form:

\[\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C \mathbb{I}(y_i = c) \log p(y_i = c \mid \mathbf{x}_i, \theta)\]

Where:

$C$ is the number of classes
$\mathbb{I}(y_i = c)$ is the indicator function: it equals 1 if the true label $y_i$ matches the current class $c$, and 0 otherwise.

Likelihood and Log-Likelihood

NLL loss comes from maximum likelihood estimation (MLE).
For a model predicting probabilities, we want to maximize the likelihood of the correct label:

\[p(y_i \mid x_i, \theta)\]

Working in log-space (log-likelihood) is easier numerically:

\[\log p(y_i \mid x_i, \theta)\]

Taking the log turns products into sums, avoids underflow, and preserves monotonicity.

Negative Log-Likelihood (NLL)

Since optimizers usually minimize, we take the negative:

\[L(\theta) = -\log p(y_i \mid x_i, \theta)\]

For a batch of ( N ) samples:

\[L(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p(y_i \mid x_i, \theta)\]

Connection to Cross-Entropy

In classification, ( p(y_i \mid x_i, \theta) ) is the predicted probability for the true class.
Using a one-hot label ( y_i ) and ( C ) classes:

\[L(\theta) = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C \mathbb{I}(y_i = c) \log p(y_i = c \mid x_i, \theta)\]

Here ( \mathbb{I}(y_i = c) ) is 1 if ( y_i = c ), else 0.
This is exactly the cross-entropy loss between true and predicted distributions.

Practical Example (Log-Softmax + NLL)

In PyTorch, NLLLoss expects log-probabilities, so we apply LogSoftmax first:

\[\log p(y_i = c \mid x_i) = z_c - \log\left(\sum_{j=1}^C e^{z_j}\right)\]

Then NLL picks the log-probability of the correct class and negates it:

\[\text{loss} = -\log p(y_i = \text{true class} \mid x_i)\]

Why Use NLL?

Probabilistic interpretation — maximizes likelihood of correct predictions.
Numerical stability — log-space avoids overflow.
Gradient behavior — penalizes confident wrong predictions heavily.

Code implementation

All is well and good with this explanation but I wanted to see it implemented in code.

import numpy as np
import torch

# For reproducibility
np.random.seed(42)

def log_softmax(x):
    """Compute log-softmax along the last dimension."""
    max_x = np.max(x, axis=1, keepdims=True)  # Numerical stability
    exp_x = np.exp(x - max_x)
    sum_exp_x = np.sum(exp_x, axis=1, keepdims=True)
    return (x - max_x) - np.log(sum_exp_x)

def nll_loss(log_probs, targets):
    """Compute mean negative log-likelihood loss."""
    n_samples = log_probs.shape[0]
    log_probs_target = log_probs[np.arange(n_samples), targets]
    return -np.mean(log_probs_target)

# Example input (N=3, C=5)
inputs = np.random.randn(3, 5)
targets = np.array([1, 0, 4])

# NumPy implementation
log_probs_np = log_softmax(inputs)
loss_np = nll_loss(log_probs_np, targets)

print("Input:\n", inputs)
print("Log-Softmax:\n", log_probs_np)
print("Targets:", targets)
print("NumPy Loss:", loss_np)

# Just to be extra sure, we wanted to test it against the real implementation
torch_inputs = torch.tensor(inputs, dtype=torch.float32)
torch_targets = torch.tensor(targets, dtype=torch.long)

log_softmax_torch = torch.nn.LogSoftmax(dim=1)
loss_fn_torch = torch.nn.NLLLoss()
loss_torch = loss_fn_torch(log_softmax_torch(torch_inputs), torch_targets)

print("PyTorch Loss:", loss_torch.item())

Output:

Input:
 [[ 0.49671415 -0.1382643   0.64768854  1.52302986 -0.23415337]
 [-0.23413696  1.57921282  0.76743473 -0.46947439  0.54256004]
 [-0.46341769 -0.46572975  0.24196227 -1.91328024 -1.72491783]]
Log-Softmax:
 [[-1.78593751 -2.42091597 -1.63496313 -0.75962181 -2.51680504]
 [-2.55085751 -0.73750774 -1.54928582 -2.78619494 -1.77416051]
 [-1.51295736 -1.51526942 -0.8075774  -2.96281991 -2.7744575 ]]
Targets: [1 0 4]
NumPy Loss: 2.5820769923174507
PyTorch Loss: 2.5820770263671875

Why subtracting the max improves numerical stability

When computing softmax or log-softmax:

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

Large values of $x$ can cause exp() to overflow.
For example, $e^{1000}$ is far larger than what floating-point numbers can store, leading to inf or nan.

Key trick: subtract the maximum value $m = \max(x)$ from all elements:

\[\frac{e^{x_i}}{\sum_j e^{x_j}} = \frac{e^{x_i - m}}{\sum_j e^{x_j - m}}\]

This does not change the result (since multiplying numerator and denominator by $e^{-m}$ cancels out) but ensures:

The largest exponent becomes $e^0 = 1$
All other exponentials are ≤ 1
No overflow, better floating-point precision

For log-softmax:

\[\log \text{softmax}(x_i) = (x_i - m) - \log \sum_j e^{x_j - m}\]

Subtracting $m$ ensures all exp() calls stay within a safe range, avoiding numerical instability.