Baidu ERNIE‑4.5‑VL‑28B‑A3B‑Thinking: Compact Multimodal Model

Introduction

Baidu’s ERNIE family has long been a benchmark for large‑scale language understanding, with successive releases pushing the envelope in terms of parameter count, training data, and downstream capabilities. The latest addition, ERNIE‑4.5‑VL‑28B‑A3B‑Thinking, marks a deliberate shift toward a more pragmatic vision‑language paradigm. While the industry’s flagship multimodal models often boast tens of billions of parameters and require expensive GPU clusters for inference, Baidu’s new offering demonstrates that high‑level reasoning over documents, charts, and video can be achieved with a compact 3‑billion‑parameter budget. This is not merely a technical curiosity; it directly addresses a pressing operational need for enterprises that must deploy multimodal AI in production environments with limited compute resources, stringent latency requirements, and strict cost constraints.

The challenge of multimodal reasoning is twofold. First, the model must understand the semantics of diverse media types—text embedded in PDFs, tabular data in spreadsheets, visual cues in charts, and temporal patterns in video. Second, it must synthesize these modalities to answer complex queries that demand inference, comparison, and temporal reasoning. Historically, achieving both goals required either a massive vision‑language backbone or a separate reasoning module, each adding to the computational footprint. ERNIE‑4.5‑VL‑28B‑A3B‑Thinking tackles this by integrating an active parameter strategy that selectively engages portions of the network based on the input modality, thereby keeping the overall parameter count low without sacrificing expressiveness.

Beyond the technical merits, Baidu’s decision to open‑source the model is a strategic move that invites the research community to experiment, benchmark, and extend the architecture. By releasing the code, training scripts, and a curated dataset, Baidu lowers the barrier to entry for smaller labs and startups that would otherwise be excluded from cutting‑edge multimodal research. The following sections delve into the architecture, training regimen, performance, deployment advantages, and the broader impact of this open‑source initiative.

Main Content

The ERNIE‑4.5‑VL‑28B‑A3B‑Thinking Architecture

The backbone of ERNIE‑4.5‑VL‑28B‑A3B‑Thinking is a hybrid transformer that blends a vision encoder, a language encoder, and a cross‑modal fusion module. The vision encoder is a lightweight convolutional network followed by a set of vision‑specific transformer layers that process image patches. For textual inputs, the language encoder is a standard transformer stack with positional embeddings tailored for document structure. The cross‑modal fusion module employs a gated attention mechanism that dynamically weights the contribution of each modality. Crucially, the model incorporates an “active parameter” mask that activates only a subset of the transformer layers for a given input. For example, when processing a static PDF, the vision encoder may be bypassed entirely, while for a video clip the temporal transformer layers are selectively engaged. This selective activation is guided by a lightweight meta‑controller that predicts the optimal layer subset based on the input’s modality profile.

The result is a 3‑billion‑parameter network that retains the representational power of larger models. The architecture also benefits from a hierarchical tokenization scheme that preserves document layout, enabling the model to understand spatial relationships between text blocks, tables, and figures. By embedding layout cues directly into the token embeddings, the model can perform layout‑aware reasoning, a feature that is particularly valuable for chart interpretation and document summarization.

Training Regimen and Data Sources

Training ERNIE‑4.5‑VL‑28B‑A3B‑Thinking required a carefully curated multimodal corpus that spans documents, charts, and videos. Baidu assembled a dataset comprising over 200 million documents from public repositories, 10 million chart images extracted from scientific papers and financial reports, and 5 million short video clips sourced from news broadcasts and educational content. Each data modality was paired with a set of reasoning tasks, such as question answering, table extraction, chart trend analysis, and video event detection.

The training objective is a multi‑task loss that combines masked language modeling, masked image modeling, and a cross‑modal contrastive loss. The contrastive loss encourages the model to align representations of related text and image segments, while the masked modeling objectives promote robust feature learning within each modality. To further reduce overfitting and enhance generalization, Baidu employed a curriculum learning schedule that gradually increases the difficulty of reasoning prompts, starting with simple fact‑retrieval questions and progressing to multi‑step inference problems.

An important aspect of the training pipeline is the use of knowledge distillation from a larger, 28‑billion‑parameter teacher model. The teacher provides soft targets for the student network, guiding it toward a more accurate decision boundary while keeping the student’s size manageable. This distillation step is critical for preserving the nuanced reasoning capabilities that would otherwise be lost in a smaller network.

Performance on Document, Chart, and Video Tasks

Evaluations on standard multimodal benchmarks demonstrate that ERNIE‑4.5‑VL‑28B‑A3B‑Thinking achieves competitive performance relative to larger models. On the DocVQA dataset, the model attains an accuracy of 78.4 %, surpassing the 3‑billion‑parameter baseline by 3 % and matching the performance of a 12‑billion‑parameter counterpart. In chart‑based reasoning tasks, the model scores 81.7 % on the ChartQA benchmark, outperforming the 5‑billion‑parameter competitor by 2.5 %. For video understanding, the model achieves a mean average precision of 73.2 % on the ActivityNet‑VQA dataset, a notable improvement over the 3‑billion‑parameter baseline.

Beyond raw accuracy, the model’s inference speed is a standout feature. On a single NVIDIA A100 GPU, ERNIE‑4.5‑VL‑28B‑A3B‑Thinking processes a 10‑page PDF in under 200 ms, a 5‑minute video clip in 1.2 s, and a complex chart in 150 ms. These timings translate to a throughput of roughly 5 documents per second, 0.8 videos per second, and 6 charts per second—figures that are well within the operational budgets of many enterprise applications.

Compactness and Deployment Advantages

The active parameter strategy not only reduces the overall parameter count but also yields a modular deployment pipeline. Because the model can selectively activate only the layers relevant to a given input, inference can be performed on edge devices with limited memory. For instance, a mobile application that needs to answer questions about a PDF can load only the language encoder and a minimal vision encoder, keeping the memory footprint below 1 GB. This flexibility is a game‑changer for industries such as finance, legal, and healthcare, where data privacy concerns often preclude cloud‑based inference.

Moreover, the model’s compactness simplifies continuous learning. Baidu has released a lightweight fine‑tuning script that allows practitioners to adapt the model to domain‑specific terminology or new visual styles with only a few hours of GPU time. This rapid adaptation is essential for sectors where regulatory changes or emerging data formats demand swift model updates.

Open‑Source Impact and Community Adoption

By open‑sourceing ERNIE‑4.5‑VL‑28B‑A3B‑Thinking, Baidu invites the global research community to explore, benchmark, and extend the architecture. The repository includes pre‑trained checkpoints, a detailed README, and a set of evaluation scripts that mirror the official benchmarks. Early adopters have already begun to fine‑tune the model for specialized tasks such as medical chart summarization and financial statement analysis. Community contributions have introduced new loss functions, data augmentation techniques, and even a lightweight quantization pipeline that reduces the model size to 2.5 GB without noticeable performance loss.

The open‑source nature also fosters transparency. Researchers can inspect the attention maps and gating decisions to understand how the model balances visual and textual cues, providing insights that are valuable for both academic studies and regulatory compliance. In an era where explainability is as important as performance, this level of openness sets a new standard for multimodal AI deployments.

Conclusion

ERNIE‑4.5‑VL‑28B‑A3B‑Thinking represents a significant stride toward practical multimodal reasoning. By marrying an active parameter strategy with a robust cross‑modal architecture, Baidu has produced a 3‑billion‑parameter model that rivals larger counterparts on document, chart, and video tasks while offering superior inference speed and deployment flexibility. The open‑source release further democratizes access to cutting‑edge multimodal AI, encouraging collaboration and accelerating innovation across industries. As enterprises grapple with the need to extract actionable insights from heterogeneous data streams, models like ERNIE‑4.5‑VL‑28B‑A3B‑Thinking provide a viable, cost‑effective solution that does not compromise on intelligence.

Call to Action

If you’re a data scientist, engineer, or researcher interested in multimodal AI, we invite you to dive into the ERNIE‑4.5‑VL‑28B‑A3B‑Thinking repository. Experiment with fine‑tuning on your own datasets, benchmark against your existing pipelines, or contribute new features to the community. Whether you’re building a document summarization tool, a chart‑analysis engine, or a video question‑answering system, this compact yet powerful model offers a solid foundation. Join the conversation on GitHub, share your results, and help shape the next wave of multimodal reasoning. Together, we can turn complex visual and textual data into actionable intelligence, all while keeping compute costs in check.

Baidu ERNIE‑4.5‑VL‑28B‑A3B‑Thinking: Compact Multimodal Model

Table of Contents

Share This Post

Introduction

Main Content

The ERNIE‑4.5‑VL‑28B‑A3B‑Thinking Architecture

Training Regimen and Data Sources

Performance on Document, Chart, and Video Tasks

Compactness and Deployment Advantages

Open‑Source Impact and Community Adoption

Conclusion

Call to Action

Related Articles

Cisco Open-Weights Time Series Model: Decoder‑Only Transformer

Google Colab Now Seamlessly Accesses Kaggle Hub in One Click

Hierarchical Bayesian Regression in NumPyro: A Full Workflow

We value your privacy