Introduction
Warner Bros. Discovery (WBD) is a global media powerhouse that delivers an expansive catalog of television, film, and streaming content. In a landscape where audience attention is fragmented across countless platforms, the company relies heavily on real‑time recommendation engines to keep viewers engaged and to drive subscription growth. These recommendation systems are powered by sophisticated machine‑learning (ML) models that ingest vast amounts of user data, content metadata, and contextual signals to generate personalized suggestions within milliseconds. The sheer scale of WBD’s operations—serving millions of users worldwide—means that the underlying inference infrastructure must be both highly performant and cost‑efficient.
To meet these demands, WBD turned to Amazon Web Services (AWS) and specifically to the Graviton family of Arm‑based processors. By deploying Amazon SageMaker instances powered by Graviton, the company was able to achieve a remarkable 60 % reduction in inference costs while simultaneously improving latency by 7 % to 60 % across a range of models. This post delves into the challenges WBD faced, the architectural choices that led to these savings, and the practical lessons that other enterprises can draw from this success story.
Main Content
The Scale of WBD’s AI Demands
WBD’s recommendation pipeline is a multi‑stage architecture that begins with data ingestion, moves through feature engineering, and culminates in model inference. The pipeline processes terabytes of data each day, and the inference stage must handle millions of requests per second during peak viewing hours. Traditional x86‑based instances, while powerful, quickly became a bottleneck both in terms of raw throughput and operational cost. The company’s data scientists and infrastructure engineers needed a solution that could deliver the same computational power at a fraction of the price, without compromising on the low‑latency requirements of real‑time personalization.
Choosing the Right Hardware: AWS Graviton
AWS Graviton processors are built on the Arm architecture and are designed for high performance per watt. They offer a compelling blend of compute density, memory bandwidth, and energy efficiency. For WBD, the decision to adopt Graviton was driven by several factors:
- Cost Efficiency – Graviton instances are priced lower than their x86 counterparts, which directly translates to savings in the inference tier.
- Performance Parity – Benchmarks from AWS and third‑party studies show that Arm‑based instances can match or exceed x86 performance for many ML workloads, especially those that are highly parallel.
- Ecosystem Support – Amazon SageMaker, the managed ML service used by WBD, provides native support for Graviton instances, simplifying deployment and scaling.
By aligning the hardware choice with the specific workload characteristics—namely, the need for high throughput and low latency—WBD positioned itself to reap significant operational benefits.
Implementation with SageMaker
SageMaker offers a fully managed environment for training, deploying, and monitoring ML models. WBD leveraged SageMaker’s inference endpoints, which can be configured to run on Graviton instances. The deployment process involved several key steps:
- Model Packaging – Models were containerized using Docker, ensuring that all dependencies were encapsulated and that the inference code could run consistently across environments.
- Endpoint Configuration – Each model was deployed to a dedicated endpoint with a specified instance type (e.g., ml.m5.4xlarge for x86 and ml.m6g.4xlarge for Graviton). The endpoints were configured with autoscaling policies to handle traffic spikes.
- Performance Tuning – WBD’s data scientists worked closely with AWS support to fine‑tune the inference code, optimizing for the Arm instruction set and leveraging vectorized operations where possible.
- Monitoring and Feedback – SageMaker’s built‑in metrics and logs were used to track latency, error rates, and resource utilization. This data informed iterative improvements to both the models and the deployment configuration.
The result was a robust inference pipeline that could scale horizontally across thousands of requests per second while maintaining strict latency guarantees.
Cost and Performance Gains
After migrating to Graviton‑powered SageMaker endpoints, WBD reported a 60 % reduction in inference costs. This figure reflects a combination of lower instance pricing and improved resource utilization. Because Graviton instances deliver comparable performance to x86 instances, WBD could reduce the number of instances needed to handle peak traffic, further cutting operational expenses.
Latency improvements were equally impressive. Depending on the model, WBD observed reductions ranging from 7 % to 60 %. The most significant gains came from models that were heavily parallelizable, such as collaborative filtering and neural collaborative filtering architectures. By exploiting the higher memory bandwidth and lower CPU clock latency of Arm processors, the inference time for these models dropped dramatically.
Beyond raw numbers, the migration also yielded qualitative benefits. The team reported smoother scaling during unexpected traffic surges, reduced cold‑start times for new endpoints, and a more predictable cost structure that aligned with quarterly budgets.
Operational Insights and Best Practices
WBD’s journey offers several actionable insights for organizations considering a similar transition:
- Start with a Pilot – Deploy a single model to a Graviton endpoint and compare performance and cost against the existing x86 baseline. This controlled experiment helps validate assumptions before a full rollout.
- Leverage Managed Services – Using SageMaker’s managed inference endpoints abstracts away many of the operational complexities, allowing teams to focus on model quality rather than infrastructure maintenance.
- Optimize for Arm – Even if the underlying code is already efficient, small adjustments—such as using Arm‑specific libraries or vectorized operations—can unlock additional performance gains.
- Monitor Continuously – Real‑time metrics are essential for detecting regressions or bottlenecks early. Integrate monitoring dashboards with alerting to maintain service level objectives.
- Plan for Scale – Autoscaling policies should be tuned to the specific traffic patterns of the media industry, where viewership can spike during major releases or live events.
By following these practices, teams can replicate WBD’s success and achieve similar cost and performance improvements.
Future Directions
Looking ahead, WBD plans to expand its use of Arm‑based infrastructure beyond inference. Training workloads, which are often even more compute‑intensive, are a natural next step. AWS has announced new Graviton‑based instances optimized for training, and WBD is evaluating these for large‑scale model training pipelines. Additionally, the company is exploring hybrid architectures that combine on‑premises edge devices with cloud‑based inference to reduce latency for geographically dispersed audiences.
Conclusion
Warner Bros. Discovery’s migration to AWS Graviton‑powered SageMaker endpoints demonstrates that strategic hardware choices can deliver substantial cost savings and performance gains for large‑scale ML inference workloads. By aligning the Arm architecture with the specific demands of real‑time recommendation systems, WBD achieved a 60 % reduction in inference costs and significant latency improvements across a diverse set of models. The experience underscores the importance of a data‑driven approach to infrastructure selection, the value of managed services in simplifying operations, and the tangible business benefits that can arise from thoughtful optimization.
Call to Action
If your organization relies on real‑time ML inference—whether for media recommendation, e‑commerce personalization, or any latency‑sensitive application—consider evaluating Arm‑based infrastructure. Start by benchmarking your current workloads against AWS Graviton instances, and explore SageMaker’s managed inference endpoints to reduce operational overhead. Reach out to AWS or a trusted partner to discuss how a migration could unlock cost efficiencies and performance improvements similar to those achieved by Warner Bros. Discovery. The next step could be the most impactful decision you make for your ML strategy this year.