AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge

Noriaki Hirose^{1, 2}, Catherine Glossop¹, Dhruv Shah³, Sergey Levine¹

¹ University of California, Berkeley, ² Toyota Motor North America, ³ Princeton University

Key Highlights

AsyncVLA is an asynchronous control framework that enables real-time deployment of large robotic foundation models while handling network and inference latency.

Asynchronous VLA: High-level reasoning is seperate from onboard reactive control.
End-to-End Learning: Base VLA is aligned with the edge control for rapid adaptation.
Robust Real-World Performance: High success with few collisions under latency.

Robotic foundation models generalize well via internet-scale vision-language representations, but their high computational cost causes latency that breaks the control loop, making them unsafe for real-time deployment. We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution. Inspired by hierarchical control, AsyncVLA runs a large foundation model on a remote workstation for high-level guidance, while a lightweight onboard Edge Adapter continuously refines actions. To align these asynchronous streams, we introduce end-to-end finetuning and a trajectory re-weighting strategy prioritizing dynamic interactions. On real-world vision-based navigation tasks with delays up to 6 seconds, AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines, bridging semantic intelligence and edge reactivity.

Motivation

VLAs inherit internet-scale knowledge from VLMs, giving them far superior vision and language understanding compared to models trained from scratch. However, scaling up robot policies comes at the cost of inference speed, which can degrade performance. In particular, longer inference times slow the control loop, causing delays that impact overall system behavior. This paper addresses the question: how can large robotic foundation models be deployed on the edge without being limited by computational cost?

These challenges are especially pronounced in mobile robots. With a limited ego-centric field of view, robots cannot fully perceive their surroundings from a single observation. Dynamic obstacles, such as pedestrians, exacerbate this issue. Mobile robots must continuously update observations and generate actions based on the latest sensory inputs. Scaling foundation models increases computational load, directly reducing performance in real-world navigation tasks.

AsyncVLA

We propose AsyncVLA, an asynchronous navigation system that combines the rich language-visual understanding of robotic foundation models with fast, reactive control. To achieve reactive behavior without losing semantic understanding, AsyncVLA integrates three key components: (1) a lightweight onboard Edge Adapter for action refinement, (2) automatic data re-balancing to encourage responsiveness, and (3) end-to-end training to better align the base VLA with the Edge Adapter. During inference, AsyncVLA separates the system: the base VLA runs on a high-performance workstation, while the Edge Adapter operates on the robot’s controller, processing observations with minimal latency for rapid reactions.

Network

We build AsyncVLA on OmniVLA, a vision-based robotic foundation model for navigation, which serves as our base VLA. OmniVLA generates actions conditioned on multiple goal modalities, including 2D goal poses, language instructions, and egocentric goal images. It interprets these modalities using large-scale visual encoders (SigLIP and DINOv2) and a language model (LLaMA2 7B).

Since the Edge Adapter handles action generation, we feed it OmniVLA’s action token embeddings, which encode both action features and the semantic-visual information interpreted by OmniVLA. These embeddings have a high dimensionality of 8×4×4096 (action chunk size × action dimension × embedding size). Feeding them directly into the Edge Adapter would increase onboard network size and computational cost. To address this, we apply a **token projector** with two MLP ResNet blocks to compress each token from 4×4096 to 1024 dimensions, enabling efficient transmission from the workstation to the robot during inference.

Since our AsyncVLA can react to the latest observation without delay, our AsyncVLA allows the robot to yield to the pedestrian, pass safely without collision, and then proceed toward the goal. In contrast, ``OmniVLA'', which runs inference on a workstation, suffers from delayed action updates and consequently collides with the pedestrian, failing to reach the goal.

We evaluate policy robustness under artificially introduced workstation delays of 0.2 s, 2.0 s, and 5.0 s, while the robot’s onboard controller computes actions at its maximum rate. As shown in the following graph, AsyncVLA outperforms OmniVLA even at 0.2 s latency, with the performance gap widening as delays increase. The onboard Edge Adapter adapts the base VLA’s guidance based on current observations, whereas OmniVLA generates actions from stale data, causing navigation performance to degrade under higher latencies.

The videos compare AsyncVLA with the strong baseline, OmniVLA. At 5 Hz without delay, both perform well, but at 0.2 Hz with artificial delay, OmniVLA overshoots and fails to navigate. In contrast, AsyncVLA remains reliable, as the onboard Edge Adapter quickly adjusts actions and guides the robot toward the goal.

Since AsyncVLA builds on OmniVLA, it inherits the ability to translate high-level language and visual understanding into actions, enabling effective instruction-following. In our evaluations, AsyncVLA remains robust to out-of-distribution language instructions, achieving performance comparable to OmniVLA.

AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge

Key Highlights

Abstract

Approach

Motivation

AsyncVLA

Network

Experiments

2D goal pose-conditioned navigation with pedestrians

Robustness for large delay in 2D goal pose-conditioned navigation

Language-conditioned navigation

BibTeX