Introducing Orby AI’s Generic Agent Framework and Self-Adaptive Interface Learning Technique

Elevating Automation: Orby AI’s Generic Agent Framework and Self-Adaptive Interface Learning Technique

Orby AI’s Pioneering Generic Agent Framework and Self-Adaptive Interface Learning (SAIL) Techniques Set New Standards, Achieving SOTA Performance on AI Agent Benchmarks

February 25, 2025

•

Author:

Orby AI

Today marks a significant development milestone with the release of Orby AI's Generic Agent Framework and improved Large Action Model (LAM), ActIO, with evaluation results on public AI agent benchmarks. Orby AI’s end-to-end agent framework combines a versatile, hierarchical design with powerful new modeling techniques to deliver state-of-the-art (SOTA) performance on complex graphical user interface (GUI) tasks.

Unlike agent solutions that depend on hand-crafted instructions or domain-specific customization, Orby’s framework remains fully generic: users can choose whichever foundation models or providers are best suited to their needs without sacrificing task success rates. By splitting responsibilities between a higher-level “planner” and a dedicated “grounder,” our approach simplifies complex interactions, avoiding confusion when planned steps cannot be fully executed and letting each agent excel in its specialized role.

AI agent performance benchmarking is the process of evaluating and comparing the capabilities of autonomous agents across standardized tasks or environments, using predefined metrics to assess their effectiveness, adaptability, and success in completing complex challenges. Orby AI benchmarks its models against these industry-standard evaluations to ensure our agents achieve state-of-the-art performance, generalize across diverse real-world scenarios, and outperform existing solutions in both accuracy and efficiency.

On the challenging MiniWoB benchmark—a suite of 125 diverse web tasks—Orby AI’s Generic Agent Framework outperforms existing systems, achieving a 74.9% success rate with Claude-3.5-Sonnet, surpassing ServiceNow’s best result under the same evaluation protocol.

Beyond MiniWoB, our new large action model, ActIO, leverages Self-Adaptive Interface Learning (SAIL) to automatically learn website interfaces at scale, eliminating reliance on human-curated guidance or domain-specific prompts.

This capability is demonstrated on the more expansive and varied BrowserGym / WebArena benchmark, where Orby AI’s ActIO-72B (Preview) delivers the highest success rate of 37.5% — outperforming models from open-source and closed-source competitors — highlighting our framework’s flexibility and robust performance across real-world, unfamiliar web environments.

Generic Agent Framework

Orby’s agent framework is designed to be generic, meaning that it’s not customized to any specific website, benchmark, or underlying models. The user can always select the best models depending on their use case, websites, applications, and provider preferences without having to make changes to the system to maintain performance. Generalizability is the primary way we ensure our customers can easily use our platform for complex enterprise workflows, differentiating Orby from other agent providers.

Orby’s agent framework adopts a hierarchical design that employs multiple agents to complete complex graphical user interface (GUI) tasks. When the user provides the goal of a task, our higher-level planner agent takes in the goal, the current state of the application, actions the agent has performed previously (if any), and describes the next step of action in natural language. Then, a dedicated grounder agent converts this description to executable code. In this way, the planner agent can focus on reasoning the next step without having to worry about specific details related to the execution environment, e.g., the available action types, the output format, or grounding the output to a specific interactive element or location on the screen.

One challenge of this approach that we uncovered through extensive experimentation is that the grounder can sometimes be limited by the action space to fully complete a step suggested by the planner. Partially executed steps can confuse the planner and lead to suboptimal execution performance. To address this issue, in Orby’s agent framework, the grounder is not only responsible for producing a grounded action but also a precise description of the generated action in the execution context. This enables the planner to precisely measure the progress even when the planned actions cannot be fully executed and allows the planner and grounder to work seamlessly together. This unique design has also enabled us to improve our model planning and grounding capabilities.

MiniWoB (short for Mini World of Bits) is an established benchmark focusing on fundamental GUI interactions in a simplified web setting. It offers a collection of 125 small web tasks that range from simple operations like clicking a button or checking a box to using a basic text editor and placing orders. Most tasks require the agent to explore the environment, understand past actions, and intelligently choose the action types and elements to generate actions. We believe it is one of the best benchmarks for evaluating a GUI agent’s capability to interact with diverse dynamic GUI widgets. Orby AI GenericAgent framework showed significant improvement over prior state-of-the-art (SOTA), ServiceNow GenericAgent, when evaluated with the same BrowserGym benchmark framework. ServiceNow’s BrowserGym is an open-source framework designed to facilitate the development, testing, and evaluation of web agents for web interactions.

‍

Orby’s Large Action Model, ActIO

Recent technology advancements in reasoning models have shown the possibility of developing models that excel in math and coding. Unlike math or coding, completing tasks via UI interactions not only requires strong reasoning skills but, more importantly, grounded knowledge of the specific websites the digital agent operates on. Many teams have found human-curated instructions and documentation helpful in helping LLMs to interact with specific websites and applications. At Orby AI, we believe AI agents should learn to adapt to new websites and applications without human interaction. We have recently developed a new technique called Self-Adaptive Interface Learning (SAIL) for automatically navigating websites to improve the task success rate of the agent. After applying this technique across diverse websites, we are able to collect tens of thousands of trajectories to train our large action model, which demonstrates strong out-of-box performance across multiple benchmarks.

WebArena is a recently introduced benchmark that provides a highly realistic web environment for autonomous agents. It comprises 812 diverse tasks based on fully functional websites across multiple real-world domains. However, many of those websites are not commonly seen by general Web users. For instance, the shopping admin website is only used by business users who manage an e-commerce website. Those uncommon websites and tasks make it extremely difficult for models trained on general-domain data to perform well on this benchmark. Many previous works (including OpenAI Operator) managed to achieve good performance on WebArena by using website-specific prompts where extensive human-curated instructions and hints are provided to keep the agents on track in completing tasks.

By applying Self-Adaptive Interface Learning (SAIL) on WebArena environments, we have enabled our Large Action Model, ActIO, to adapt to WebArena environments automatically and outperform larger open-source LLMs, including the best proprietary models in the generic agent setting. This result was achieved without any prompt or agent design changes to Orby AI Generic Agent. It uses the same benchmark framework implemented by BrowserGym.

‍

Reference: https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard
*This result was achieved with manually corrected evaluation config.

‍

We have observed further improvements by applying SAIL on even more websites and applications to obtain training data at a larger volume and with greater diversity. We will share more details about this technique in our upcoming announcements.

Orby AI Agent Platform

Orby AI’s Generic Agent Framework and Large Action Models form the cornerstone of our Orby AI Agent Platform. This platform — a suite of tools and infrastructure — facilitates the rapid development, deployment, and management of complex AI agents, even when tasks require hundreds of steps and the agent must interpret unstructured data. At the core of our platform lies a unique neuro-symbolic approach, blending the reliability and efficiency of symbolic agents with the flexibility of neural models. This design philosophy empowers Orby to build workflow automations that are both controllable and reliable for our enterprise customers. We look forward to sharing more insights and updates on our Agent Platform and evolving agent methodologies in future announcements.

Research

Orby AI’s ActIO / UGround Achieves #1 Performance on GUI Grounding Benchmark (ScreenSpot)

Research

Orby AI’s ActIO / UGround Achieves #1 Performance on GUI Grounding Benchmark (ScreenSpot)

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Research

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

https://arxiv.org/abs/2410.05243

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Research

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

https://arxiv.org/abs/2411.06559

Introducing ActIO / UGround - Open Model for State-of-the-Art GUI Grounding

Research

Introducing ActIO / UGround - Open Model for State-of-the-Art GUI Grounding

https://www.linkedin.com/pulse/introducing-actio-uground-open-model-gui-grounding-yanan-xie-sjgac/

Elevating Automation: Orby AI’s Generic Agent Framework and Self-Adaptive Interface Learning Technique

Generic Agent Framework

Orby’s Large Action Model, ActIO

Orby AI Agent Platform

Related

Research

What will you do with your time back?