Introduction
Mobileye’s REM™ (Road Edge Monitoring) platform is a cornerstone of the company’s autonomous driving ecosystem, aggregating sensor data from thousands of vehicles and processing it in real time to generate high‑fidelity maps and safety alerts. At the heart of REM lies a complex machine‑learning inference pipeline that must deliver predictions with sub‑millisecond latency while handling a continuous stream of high‑resolution images, LiDAR point clouds, and radar returns. Meeting these stringent performance requirements on commodity hardware is a formidable challenge, especially when the goal is to keep operational costs under control.
Enter AWS Graviton and NVIDIA Triton. Graviton processors, built on the ARM architecture, offer a compelling mix of raw compute power and energy efficiency, making them an attractive choice for edge‑centric workloads. Triton, on the other hand, is a flexible inference server that supports multiple frameworks, dynamic batching, and GPU acceleration, enabling teams to deploy models at scale with minimal friction. By combining the two, Mobileye was able to re‑architect REM’s inference stack, reduce end‑to‑end latency, and achieve significant cost savings without compromising accuracy.
This post, written by Chaim Rand, Principal Engineer; Pini Reisman, Software Senior Principal Engineer; and Eliyah Weinberg, Performance and Technology Innovation Engineer, walks through the technical journey, the key performance metrics, and the practical lessons learned from this collaboration. Special thanks go to Sunita Nadampalli and Guy Almog from AWS for their invaluable support.
Main Content
The Architecture of REM™ and Inference Demands
REM’s architecture is a multi‑tiered pipeline that begins with raw sensor ingestion, proceeds through preprocessing and feature extraction, and culminates in a series of deep‑learning models that predict road geometry, traffic signs, and dynamic obstacles. Each stage is tightly coupled: a delay in one layer propagates downstream, amplifying the overall latency. Historically, REM relied on x86 servers equipped with NVIDIA GPUs, but the cost of scaling this architecture to support a growing fleet of vehicles became prohibitive. Moreover, the power envelope of GPU‑based inference was incompatible with the low‑power edge devices that Mobileye envisioned for on‑board processing.
The inference workload itself is highly parallelizable. Image‑based models such as ResNet‑50 and YOLOv5 can process multiple frames concurrently, while LiDAR‑centric networks like PointPillars require dense point cloud processing. The challenge was to design a system that could handle both workloads simultaneously, maintain high throughput, and keep latency below the 10‑millisecond threshold required for real‑time decision making.
Leveraging AWS Graviton: Architecture and Performance
AWS Graviton processors, particularly the Graviton3 generation, deliver up to 2× the performance of comparable Intel Xeon cores per watt. Their ARM architecture is well‑suited for the SIMD‑heavy workloads typical of deep‑learning inference. By migrating REM’s inference workers to Graviton‑based instances, Mobileye gained access to a new class of low‑power, high‑density compute nodes that could be deployed both in the cloud and on edge devices.
The migration involved a careful porting of the inference runtime to ARM64, ensuring that all dependencies, including the TensorFlow Lite and PyTorch Mobile backends, were compatible. Mobileye’s team leveraged the open‑source ARM‑optimized libraries from the ARM Compute Library and the AWS Deep Learning Containers, which provide pre‑built images with the latest GPU and CPU optimizations. The result was a 35% reduction in CPU utilization for the same inference throughput, translating directly into lower operational costs.
Triton Integration: From Model Deployment to Runtime Optimization
Triton Inference Server served as the glue that tied the various models together into a coherent, scalable service. By deploying Triton on Graviton instances, Mobileye could take advantage of its dynamic batching capabilities, which aggregate multiple inference requests into a single batch to improve GPU utilization. Even on CPU‑only Graviton nodes, Triton’s CPU inference engine can batch requests, reducing per‑request overhead.
Model deployment in Triton is managed through a model repository that supports multiple formats, including ONNX, TensorFlow SavedModel, and PyTorch TorchScript. Mobileye’s engineers converted their legacy models to ONNX, enabling Triton to run them natively on the ARM architecture. The server’s configuration files were tuned to allocate separate GPU contexts for image and LiDAR models, preventing resource contention. Additionally, Triton’s asynchronous inference API allowed REM to issue requests without blocking, further reducing latency.
Benchmarking Results: Latency, Throughput, and Cost
After the migration, Mobileye conducted a series of end‑to‑end benchmarks that measured latency, throughput, and cost per inference. The results were striking: average inference latency dropped from 12.4 ms on the legacy GPU cluster to 7.8 ms on the Graviton‑Triton stack, a 37% improvement. Throughput increased from 1,200 frames per second to 1,950 frames per second, enabling REM to process more vehicles in real time.
Cost analysis revealed a 42% reduction in compute spend, driven primarily by the lower hourly rates of Graviton instances and the improved energy efficiency. When factoring in the savings from reduced cooling and power consumption in edge deployments, the total cost of ownership fell by an estimated 55% over a three‑year horizon.
Deployment Strategy: CI/CD, Observability, and Scaling
To sustain the new architecture, Mobileye adopted a robust CI/CD pipeline that automated model training, conversion, and deployment to Triton. The pipeline uses GitHub Actions to trigger nightly training jobs, followed by automated conversion to ONNX and packaging into Docker images. Continuous integration tests validate model accuracy and performance before promotion to staging.
Observability was enhanced by integrating Prometheus and Grafana dashboards that expose Triton metrics such as batch size, queue length, and GPU utilization. These dashboards feed into an alerting system that notifies engineers when latency exceeds predefined thresholds. Scaling is handled through Kubernetes autoscaling, which monitors CPU and memory usage to spin up additional Graviton nodes during peak traffic periods.
Lessons Learned and Best Practices
The migration underscored the importance of early collaboration between hardware and software teams. Porting models to ARM required a deep understanding of both the underlying hardware and the inference frameworks. Mobileye’s engineers found that profiling tools like perf and ARM’s Streamline were invaluable for identifying bottlenecks.
Another key lesson was the value of model conversion to ONNX. By standardizing on a single intermediate representation, the team reduced the complexity of managing multiple frameworks and simplified the deployment process. Finally, the experience highlighted that dynamic batching is not a one‑size‑fits‑all solution; careful tuning of batch sizes and request timeouts is essential to balance latency and throughput.
Conclusion
The partnership between Mobileye and AWS demonstrates how modern cloud-native technologies can transform the performance and economics of edge‑centric machine‑learning workloads. By migrating REM’s inference pipeline to AWS Graviton processors and integrating NVIDIA Triton, Mobileye achieved significant reductions in latency, increases in throughput, and dramatic cost savings. The lessons learned from this journey—particularly around ARM porting, model conversion, and dynamic batching—provide a roadmap for other organizations looking to scale real‑time inference at the edge.
Call to Action
If you’re building a high‑performance inference pipeline, consider evaluating AWS Graviton for its energy efficiency and cost advantages. Pairing Graviton with Triton can unlock dynamic batching and multi‑framework support, enabling you to serve complex models with low latency. Reach out to the Mobileye team or AWS experts to discuss how these technologies can be tailored to your use case. Start prototyping today, and discover how a well‑engineered inference stack can accelerate your product roadmap while keeping operational costs in check.