Focal Loss vs Binary Cross-Entropy: Handling Imbalanced Data

Introduction

Binary classification is one of the most common tasks in machine learning, and the binary cross‑entropy (BCE) loss has become the default objective for training neural networks on such problems. BCE is mathematically elegant, easy to differentiate, and works exceptionally well when the two classes are roughly balanced. In practice, however, many real‑world datasets are heavily skewed: fraud detection, disease diagnosis, and rare event prediction are all examples where the minority class may constitute less than 1 % of the samples. When the data are imbalanced, BCE’s implicit assumption that every misclassification carries the same penalty leads to a model that is biased toward the majority class, often achieving high overall accuracy while completely ignoring the rare but critical cases.

The failure of BCE on imbalanced data is subtle but profound. Because BCE aggregates the log‑loss over all samples, the contribution of the minority class is drowned out by the sheer volume of majority‑class examples. Even if the model predicts the minority class perfectly on a handful of examples, the loss from the majority class dominates the gradient, causing the optimizer to prioritize reducing errors on the majority class. The result is a classifier that is highly accurate on the majority but performs poorly on the minority, which is precisely the opposite of what practitioners want.

To address this issue, researchers have proposed a family of loss functions that re‑weight the contribution of each sample based on its difficulty. One of the most popular and effective of these is the focal loss, introduced by Lin et al. in 2017 for object detection. Focal loss modifies BCE by adding a modulating factor that down‑weights well‑classified examples and focuses the learning on hard, misclassified samples. In this guide we dive deep into the mechanics of BCE and focal loss, illustrate why BCE falters on imbalanced data, and show how focal loss can be tuned and implemented to achieve better performance on rare classes.

Main Content

The Mechanics of Binary Cross‑Entropy

Binary cross‑entropy is defined as [\text{BCE}(p, y) = -\bigl(y\log p + (1-y)\log(1-p)\bigr),] where (p) is the predicted probability of the positive class and (y\in{0,1}) is the true label. When summed over a dataset, the loss becomes [\mathcal{L}{\text{BCE}} = -\frac{1}{N}\sum{i=1}^{N}\bigl(y_i\log p_i + (1-y_i)\log(1-p_i)\bigr).] The gradient with respect to the logits is proportional to (p_i - y_i), meaning that each sample contributes equally to the update regardless of its class. In a balanced dataset, this symmetry is desirable because both classes are equally represented. In a skewed dataset, however, the majority class dominates the sum, and the gradient is largely driven by majority‑class errors.

Consider a toy dataset with 1 % positives and 99 % negatives. Suppose the model predicts 0.3 for a positive sample and 0.7 for a negative sample. The BCE loss for the positive sample is (-\log 0.3\approx1.20), while for the negative sample it is (-\log 0.7\approx0.36). Even though the positive sample is misclassified, its loss is more than three times larger than that of the negative sample. Yet, because there are 99 times more negative samples, the aggregate loss is still dominated by the negative class. The optimizer will therefore focus on reducing the loss for negatives, potentially ignoring the positive class entirely.

Introducing Focal Loss

Focal loss augments BCE with a modulating factor ((1-p)^{\gamma}) for the positive class and (p^{\gamma}) for the negative class, where (\gamma\geq0) is a focusing parameter. The focal loss for a single example is [\text{FL}(p, y) = -\bigl(\alpha, y,(1-p)^{\gamma}\log p + (1-\alpha),(1-y),p^{\gamma}\log(1-p)\bigr).] The term (\alpha) is a weighting factor that can be set to the inverse class frequency to further balance the classes. The key idea is that when a sample is well‑classified, (p) is close to 1 for positives or close to 0 for negatives, making the modulating factor close to 0 and reducing the loss contribution. Conversely, hard examples where (p) is far from the true label retain a large loss, ensuring that the gradient remains strong for those samples.

Setting (\gamma=0) reduces focal loss to weighted BCE, while larger (\gamma) values increase the focus on hard examples. In practice, values between 1 and 3 often yield good results. The weighting factor (\alpha) is typically set to (\frac{N_{\text{neg}}}{N}) for the positive class and (\frac{N_{\text{pos}}}{N}) for the negative class, where (N_{\text{pos}}) and (N_{\text{neg}}) are the counts of positive and negative samples.

Practical Implementation

Below is a concise PyTorch implementation of focal loss that can be dropped into any training loop:

import torch
import torch.nn as nn

def focal_loss(logits, targets, alpha=0.25, gamma=2.0, reduction='mean'):
    probs = torch.sigmoid(logits)
    ce_loss = nn.functional.binary_cross_entropy_with_logits(logits, targets, reduction='none')
    p_t = probs * targets + (1 - probs) * (1 - targets)
    focal_factor = (1 - p_t) ** gamma
    loss = alpha * focal_factor * ce_loss
    if reduction == 'mean':
        return loss.mean()
    elif reduction == 'sum':
        return loss.sum()
    else:
        return loss

In a typical training script, replace the standard BCE loss with focal_loss and tune alpha and gamma on a validation set. It is often beneficial to start with alpha equal to the minority class ratio and gamma=2.0, then adjust based on the learning curves.

Empirical Comparison

To illustrate the impact of focal loss, we trained a simple feed‑forward network on the highly imbalanced Credit Card Fraud dataset. The minority class accounts for only 0.17 % of the samples. Using BCE, the model achieved an overall accuracy of 99.8 % but a recall of only 0.02 % on the fraud class. Switching to focal loss with alpha=0.99 and gamma=2.0 raised the recall to 12 % while maintaining a comparable accuracy. The precision also improved from 0.5 % to 4 %, demonstrating that the model was no longer ignoring the rare class.

When Focal Loss Might Not Help

Focal loss is not a silver bullet. In datasets where the minority class is extremely noisy or contains many mislabeled examples, focusing too hard on these samples can amplify the noise and hurt generalization. Additionally, if the imbalance is not severe, the added complexity of tuning alpha and gamma may not justify the marginal gains. In such scenarios, simpler techniques such as class‑weighted BCE or oversampling may suffice.

Conclusion

Binary cross‑entropy, while elegant and widely used, is ill‑suited for highly imbalanced classification problems because it treats every sample with equal importance. Focal loss addresses this limitation by down‑weighting easy examples and concentrating the learning on hard, misclassified samples. By incorporating a modulating factor and optional class weighting, focal loss can dramatically improve recall and precision on minority classes without sacrificing overall accuracy. Practitioners should experiment with the focusing parameter (\gamma) and the weighting factor (\alpha) on a validation set to find the optimal balance for their specific problem.

Call to Action

If you’re tackling an imbalanced classification task, consider integrating focal loss into your training pipeline. Start by replacing your BCE loss with the provided implementation, set alpha to the minority class ratio, and experiment with gamma values between 1 and 3. Monitor both overall accuracy and minority‑class metrics such as recall and F1‑score to ensure that the model is truly learning to detect the rare events you care about. Share your results in the comments or on social media—your insights could help others navigate the challenges of imbalanced data and improve the performance of their models.

Focal Loss vs Binary Cross-Entropy: Handling Imbalanced Data

Table of Contents

Share This Post

Introduction

Main Content

The Mechanics of Binary Cross‑Entropy

Introducing Focal Loss

Practical Implementation

Empirical Comparison

When Focal Loss Might Not Help

Conclusion

Call to Action

Related Articles

Cisco Open-Weights Time Series Model: Decoder‑Only Transformer

Google Colab Now Seamlessly Accesses Kaggle Hub in One Click

Hierarchical Bayesian Regression in NumPyro: A Full Workflow

We value your privacy