63. Vision-Language-Action (VLA) Models
Mini-Project: VLA Pipeline: Warehouse Robot Sorting Items Into Bins
A perceive-plan-act pipeline that identifies unsorted items on a conveyor, plans an optimized pick-and-place sequence using nearest-first routing, executes the actions, and verifies correctness with dynamic replanning when paths are blocked.
Description
VLA Models integrate vision (camera/sensor input), language (instructions/reasoning), and action (motor commands/API calls) into a unified model. They process visual observations, understand language goals, and output physical or digital actions. These models bridge the gap between LLMs (text-only) and embodied agents that need to perceive and act in the physical world.
Key examples: Google RT-2, OpenAI's vision-enabled agents, and various robotics foundation models.
Architecture Diagram
flowchart TD
A[Camera / Sensor Input] --> B[Vision Encoder]
C[Language Instruction] --> D[Language Encoder]
B --> E[Multimodal Fusion]
D --> E
E --> F[Action Decoder]
F --> G[Robot / API Actions]
style B fill:#2196F3,color:#fff
style D fill:#9C27B0,color:#fff
style E fill:#FF5722,color:#fff
style F fill:#FF9800,color:#fff
Key Models/Systems
| Model | Creator | Description |
|---|---|---|
| RT-2 | Google DeepMind | Vision-Language-Action model for robot manipulation |
| Octo | UC Berkeley | Open-source generalist robot policy |
| OpenVLA | Stanford/Berkeley | Open-source VLA for robotics |
| PaLM-E | Embodied multimodal language model |