8 min read

Physical AI Foundations: Fueling Human-Machine Interaction

AI

ThinkTools Team

AI Research Lead

Introduction

Physical artificial intelligence—often called physical AI—refers to systems that perceive, reason, and act within the tangible world. Unlike purely virtual agents that operate on text or images, physical AI must interpret noisy sensor streams, understand spatial relationships, and execute precise motor commands while maintaining safety and reliability. The promise of such systems is especially evident in healthcare, where robots can deliver supplies, assist in surgeries, or monitor patients, thereby freeing clinicians to focus on care that requires human empathy and judgment. Yet the journey from a conceptual idea to a robot that reliably navigates a busy hospital corridor is complex, involving a tightly coupled cycle of data acquisition, model development, edge deployment, and ongoing adaptation.

In this post we dissect that cycle in depth, using Diligent Robotics’ Moxi as a concrete example. Moxi is a mobile manipulation robot that has completed more than 1.2 million deliveries across hospitals worldwide, saving nearly 600 000 hours of clinical staff time. By following Moxi’s development path we can see how each stage of the pipeline contributes to the robot’s ability to understand its environment, reason about tasks, and interact safely with humans.

The discussion is framed around four core themes: grounding perception in reality, training models that can generalize to unseen situations, deploying intelligence at the edge where latency matters, and establishing continuous feedback loops that allow the robot to learn from its own experiences. Together, these themes form a blueprint that can be adapted to a wide range of physical AI applications, from warehouse automation to autonomous vehicles.

Main Content

Data Collection: Grounding Perception in Reality

The first step in any physical AI system is to build a dataset that captures the diversity of the robot’s operating environment. For Moxi, this meant recording thousands of hours of video, depth maps, and inertial measurements as the robot navigated hospital corridors, entered operating rooms, and interacted with nurses and patients. Importantly, the data collection process was not a one‑off event; it was an iterative loop where the robot’s early deployments revealed blind spots—such as cluttered medication carts or unexpected door swings—that required additional data.

To ensure that the perception models could handle the variability of real‑world lighting, occlusions, and human movement, the data was annotated with rich semantic labels: objects, people, and affordances. Rather than relying on hand‑crafted rules, the team employed semi‑automatic labeling tools that leveraged pre‑trained networks to propose bounding boxes, which human annotators then refined. This hybrid approach accelerated the labeling process while maintaining high quality.

The result was a multi‑modal dataset that fed into a perception pipeline capable of fusing RGB, depth, and LiDAR streams. By training convolutional neural networks on this data, Moxi learned to detect and localize items such as medication trays, IV bags, and surgical instruments with an accuracy that surpassed human performance in controlled tests.

Model Training: From Raw Signals to Decision Policies

Once perception was robust, the next challenge was to translate sensory input into actionable decisions. Moxi’s control architecture is built around a hierarchical policy network that first decides what to do—e.g., pick up a medication tray—and then how to do it—e.g., plan a collision‑free path and grasp trajectory.

The what layer is a high‑level planner that uses a graph‑based representation of the hospital layout. Nodes in the graph correspond to rooms, corridors, and workstations, while edges encode navigable paths. The planner receives a goal state (e.g., deliver a medication tray to the oncology ward) and computes a route that respects constraints such as avoiding restricted zones or minimizing travel time. This route is then translated into a sequence of waypoints that the how layer follows.

The how layer is a reinforcement learning (RL) agent trained in simulation before being fine‑tuned on the real robot. The simulation environment replicates the physics of the hospital floor, including friction coefficients, door mechanics, and the dynamics of the robot’s arm. By training in a sandbox, the RL agent learns to balance speed and safety, adjusting joint torques to pick up objects without dropping them or colliding with nearby staff.

A key innovation in Moxi’s training pipeline is the use of domain randomization. During simulation, the agent is exposed to a wide range of visual and physical variations—different lighting conditions, object textures, and even random perturbations to the robot’s joints. This exposure forces the policy to rely on robust features rather than overfitting to a narrow set of conditions, thereby improving transfer to the real world.

Edge Deployment: Bridging Cloud Intelligence and Physical Actuation

Deploying a sophisticated AI stack on a mobile robot presents unique constraints. Latency, bandwidth, and power consumption must be balanced against the need for real‑time decision making. Moxi’s architecture addresses these constraints by partitioning the workload between a lightweight on‑board computer and a remote cloud server.

On the robot, a compact GPU‑enabled module runs the perception pipeline and the high‑level planner. These components are latency‑critical: the robot must react within milliseconds to avoid bumping into a passing nurse. The lower‑level control loops—joint torque commands, motor velocity updates—run on a dedicated real‑time controller that guarantees deterministic timing.

Meanwhile, the cloud server hosts the RL policy and performs heavy‑weight inference that would be infeasible on the robot’s limited hardware. The robot streams compressed sensor data to the cloud, receives high‑level action recommendations, and then executes them locally. This split allows Moxi to benefit from the latest model updates without requiring a hardware overhaul.

Security and reliability are also paramount. All communication between the robot and the cloud is encrypted, and the system includes fail‑safe mechanisms that revert to a safe state if connectivity is lost. This design ensures that Moxi can continue operating safely even in environments with spotty network coverage.

Continuous Feedback Loops: Learning in the Field

A physical AI system that only learns once and then remains static is unlikely to keep pace with the dynamic nature of a hospital. Moxi incorporates a continuous learning loop that captures data from every interaction, evaluates performance, and updates models accordingly.

After each delivery, the robot logs the trajectory, sensor readings, and any anomalies such as dropped items or near‑misses. These logs feed into an automated pipeline that retrains the perception model on new data and fine‑tunes the RL policy to address observed shortcomings. Importantly, the retraining process is performed offline on the cloud, and only the updated parameters are pushed to the robot during scheduled maintenance windows.

Human operators also play a role in the feedback loop. When a nurse reports a problem—say, the robot misidentified a medication tray—the issue is logged and prioritized in the next training cycle. This human‑in‑the‑loop approach ensures that the system remains aligned with clinical workflows and safety standards.

The result is a robot that improves over time, adapting to new equipment, changes in corridor layout, or evolving staff routines. In practice, Moxi’s performance metrics—delivery time, error rate, and staff satisfaction—have shown measurable gains after each update cycle.

Conclusion

Physical AI is no longer a speculative concept; it is a mature technology that is reshaping industries where precision, safety, and adaptability are critical. By dissecting the lifecycle of Diligent Robotics’ Moxi, we have highlighted how data collection, model training, edge deployment, and continuous learning intertwine to create a system that can reliably navigate complex environments and augment human work.

The lessons from Moxi extend beyond healthcare. Any domain that requires robots to interact with humans—whether it be logistics, manufacturing, or domestic assistance—can benefit from the same principles: robust perception grounded in diverse data, hierarchical decision policies that balance high‑level goals with low‑level safety, efficient edge‑cloud architectures that respect latency constraints, and a culture of continuous improvement driven by real‑world feedback.

As the field of physical AI matures, we can expect to see more sophisticated agents that not only perform tasks but also learn to anticipate human needs, adapt to new contexts, and collaborate seamlessly with people. The future of human‑robot interaction hinges on the ability of these systems to learn from the world in real time, and the roadmap laid out by Moxi provides a practical blueprint for achieving that vision.

Call to Action

If you are a researcher, engineer, or business leader interested in deploying physical AI, start by investing in a robust data pipeline that captures the richness of your environment. Leverage simulation and domain randomization to accelerate model training, and design your system with a clear separation between latency‑critical on‑board processing and heavier cloud‑based inference. Finally, embed continuous learning into your operational workflow so that your robots can evolve alongside the people they serve. By following these principles, you can build physical AI solutions that are not only functional but also resilient, safe, and truly transformative.

We value your privacy

We use cookies, including Google Analytics, to improve your experience on our site. By accepting, you agree to our use of these cookies. Learn more