Visual Grounding
Locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or an instruction.
You’ve heard of large language models—now meet Orby’s Large Action Model (LAM).
Unlike language models that generate words, Orby’s LAM generates actions. It’s designed to not just understand your workflows but to execute them, automating complex processes from start to finish.
Think of it as AI that doesn’t just talk—it gets things done, transforming how teams work with speed, precision, and scale.
Accurately identifying the right visual element for interaction is crucial for GUI agents to perform tasks effectively in complex environments like enterprise applications. Orby’s proprietary Large Action Model, ActIO, excels in visual grounding and task execution, outperforming industry leaders.
ActIO has shown state-of-the-art performance across top GUI agent benchmarks, better than existing multimodal models. These benchmarks cover multiple scenarios, including web, desktop and mobile in both online and offline settings.
In VisualWebBench test, ActIO-7b outperforms top models like, GPT-4o, Gemini 1.5 pro and Llava 1.6-34B.
ActIO also demonstrates state-of-the-art effectiveness and proficiency in supporting GUI agents.
Detailed Large Action Model evaluation results are available on LAMB