7 min read

Baidu’s ERNIE‑4.5‑VL: Open‑Source AI That Claims to Beat GPT‑5

AI

ThinkTools Team

AI Research Lead

Introduction

In the rapidly evolving arena of multimodal artificial intelligence, a new contender has emerged from China’s leading search‑engine conglomerate, Baidu. On Monday, the company unveiled ERNIE‑4.5‑VL‑28B‑A3B‑Thinking, a model that purports to outshine the likes of Google’s Gemini 2.5 Pro and OpenAI’s GPT‑5‑High on a suite of document‑ and chart‑understanding benchmarks. What makes this announcement noteworthy is not merely the performance headline but the combination of architectural ingenuity, resource efficiency, and an open‑source release under the permissive Apache 2.0 license. For enterprises that are moving beyond experimental chatbots toward production systems that ingest documents, analyze visual data, and automate complex workflows, the promise of a high‑performance, cost‑effective vision‑language model could be a game‑changer.

Baidu’s new model is part of a broader ERNIE 4.5 family that spans ten variants, from a 424‑billion‑parameter dense model to a lightweight 0.3‑billion‑parameter version. The latest addition, ERNIE‑4.5‑VL‑28B‑A3B‑Thinking, leverages a mixture‑of‑experts (MoE) architecture that activates only a subset of its 28 billion parameters—roughly 3 billion—during inference. This selective activation, coupled with a sophisticated routing mechanism, allows the system to maintain high accuracy while operating on a single 80 GB GPU, a configuration that is far more accessible to mid‑market organizations than the multi‑GPU setups required by many competing models.

The model’s creators claim that the combination of a dynamic image‑analysis feature called “Thinking with Images,” advanced reinforcement‑learning training, and a heterogeneous modality structure gives ERNIE‑4.5‑VL a distinct edge in tasks that require both broad context and fine‑grained detail. The release has sparked excitement on social media and among developers, but independent verification of the performance claims remains pending.

Main Content

The Architecture Behind the Claim

At the heart of ERNIE‑4.5‑VL‑28B‑A3B‑Thinking is the MoE design, a pattern that has become increasingly popular for scaling large language models without proportionally increasing inference cost. In a traditional dense model, every parameter is engaged for every input, which quickly becomes untenable as the parameter count climbs into the tens of billions. Baidu’s MoE approach sidesteps this bottleneck by routing each token or image patch to a small subset of expert sub‑networks. The routing decision is made by a lightweight gating network that evaluates the relevance of each expert to the current input. As a result, the model can maintain a global capacity of 28 billion parameters while only activating 3 billion at any given time.

This selective activation is not merely a theoretical trick; it translates into tangible deployment advantages. The documentation states that the model can run on a single 80 GB GPU, a commodity piece of hardware that many enterprises already possess. In contrast, larger vision‑language models often require clusters of high‑end accelerators, driving up both capital and operational expenditures. By keeping the active parameter count low, Baidu has effectively lowered the barrier to entry for organizations that need robust multimodal capabilities but lack the budget for massive GPU farms.

Dynamic Image Analysis and Human‑Like Reasoning

One of the most striking claims in Baidu’s technical brief is the “Thinking with Images” feature, which allows the model to zoom in and out of images dynamically. Traditional vision‑language models process an image at a fixed resolution, which can limit their ability to capture both global context and fine‑grained details simultaneously. By contrast, ERNIE‑4.5‑VL can iteratively focus on different regions of an image, much like a human analyst would when inspecting a complex diagram or a quality‑control photograph.

This capability is especially relevant for enterprise scenarios such as invoice parsing, contract analysis, and manufacturing inspection. In these contexts, the model must first understand the overall layout—identifying tables, charts, and text blocks—and then drill down to individual cells or pixels to extract precise values. The dynamic zoom feature enables the system to perform multi‑step reasoning, a process that the documentation describes as “multi‑step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks.”

The ability to integrate with external tools, such as image search or OCR engines, further amplifies the model’s utility. By invoking these tools during inference, the system can augment its internal knowledge base with up‑to‑date information, thereby reducing hallucinations and improving factual accuracy.

Efficiency and Enterprise‑Friendly Design

Beyond the architectural innovations, Baidu has emphasized the model’s efficiency profile. The MoE routing mechanism, combined with a carefully engineered training pipeline that incorporates GSPO and IcePop strategies, allows the system to achieve high performance with a relatively modest compute budget. The documentation reports a 4.7 % Model FLOPs Utilization (MFU) during pre‑training—a figure that suggests the training process is highly efficient.

From a deployment standpoint, the model’s compatibility with popular open‑source frameworks—Hugging Face Transformers, vLLM, and Baidu’s own FastDeploy—means that organizations can integrate ERNIE‑4.5‑VL into existing pipelines without a complete platform overhaul. The FastDeploy toolkit, in particular, offers production‑ready inference solutions that support a range of quantization schemes, allowing enterprises to trade off precision for speed and memory usage.

The Apache 2.0 license is another strategic advantage. Unlike many closed‑source models that impose revenue‑sharing or usage restrictions, Baidu’s permissive license permits unrestricted commercial use. This openness is likely to accelerate adoption among mid‑market companies that are wary of vendor lock‑in.

Performance Claims and Independent Validation

Baidu’s performance claims—outperforming Gemini 2.5 Pro and GPT‑5‑High on document and chart understanding benchmarks—have generated both enthusiasm and skepticism. While the company cites extensive mid‑training phases and a diverse corpus of premium visual‑language reasoning data, independent benchmarks are still pending. Analysts caution that benchmark performance often fails to capture real‑world behavior, especially in domains where data distributions shift or where models must handle adversarial inputs.

Nevertheless, if the claims hold up under scrutiny, they would signal a significant shift in the multimodal AI landscape. A compact, openly available model that can match or exceed the performance of larger, proprietary systems would challenge the prevailing narrative that only massive compute budgets can yield cutting‑edge capabilities.

Conclusion

Baidu’s ERNIE‑4.5‑VL‑28B‑A3B‑Thinking represents a bold step forward in the quest for efficient, high‑performance multimodal AI. By marrying a mixture‑of‑experts architecture with dynamic image analysis and a human‑like reasoning paradigm, the model promises to deliver enterprise‑grade document understanding, chart analysis, and visual grounding without the prohibitive hardware requirements that have traditionally accompanied large vision‑language systems. The open‑source release under Apache 2.0, coupled with a robust tooling ecosystem, positions Baidu as a serious contender in the global AI infrastructure market.

While independent validation remains to be seen, the model’s design choices—efficient routing, dynamic zoom, and tool integration—address many of the pain points that have historically limited the deployment of multimodal AI in production settings. For organizations looking to unlock the value of visual data, ERNIE‑4.5‑VL offers a compelling blend of performance, cost‑effectiveness, and flexibility.

Call to Action

If you’re an enterprise data scientist, AI engineer, or product manager exploring multimodal solutions, it’s time to dive into Baidu’s ERNIE‑4.5‑VL. Start by cloning the Hugging Face repository, experiment with the “Thinking with Images” feature on your own document‑heavy workloads, and evaluate the model’s performance against your internal benchmarks. Leverage the FastDeploy toolkit to prototype a low‑latency inference pipeline that can run on a single 80 GB GPU. By engaging with the open‑source community, you can help shape the next generation of multimodal AI while positioning your organization at the forefront of visual‑language innovation.

We value your privacy

We use cookies, including Google Analytics, to improve your experience on our site. By accepting, you agree to our use of these cookies. Learn more