Introduction
The rapid evolution of large language models (LLMs) has spurred a wave of research into how these models can interact with the web in a more intelligent, efficient, and reusable manner. Traditional browser automation, which relies on long chains of clicks, keystrokes, and page navigation, is brittle and difficult to scale across diverse websites. Salesforce AI researchers have responded to this challenge with WALT—Web Agents that Learn Tools—a novel framework that reframes web interaction as a series of callable tools rather than a sequence of UI events. By reverse‑engineering the latent functionality of any website into a set of reusable operations, WALT enables LLM agents to discover, learn, and invoke these tools automatically, dramatically reducing the need for large, monolithic prompts or hand‑crafted scripts.
WALT’s core insight is that most web pages expose a set of underlying actions—searching, filtering, sorting, posting comments, creating listings—that can be abstracted as discrete, invocable functions. Instead of teaching an LLM to click the right button at the right time, WALT teaches it to call the right function with the right parameters. This shift from UI‑centric to API‑centric thinking aligns web automation more closely with how developers build software, opening the door to higher‑level reasoning, composability, and generalization across sites.
The implications of this approach are far‑reaching. For enterprises, it means that internal bots can be deployed across a wide range of legacy and modern web applications without bespoke code for each. For researchers, it provides a clean, modular way to study agent behavior and tool learning. And for the broader AI community, it offers a concrete path toward more robust, interpretable, and maintainable web‑enabled agents.
In the sections that follow, we unpack the architecture of WALT, illustrate how it discovers and invokes tools, explore real‑world use cases, and discuss the broader impact on AI development.
Main Content
Reimagining Browser Automation
Conventional browser automation tools such as Selenium or Puppeteer require a developer to write explicit scripts that navigate the Document Object Model (DOM), locate elements by CSS selectors or XPath, and perform actions in a predetermined order. While powerful, this approach is fragile: a single change in the UI can break the entire script. Moreover, it forces the agent to learn the low‑level details of each site, which is inefficient when the same high‑level task—say, searching for a product—needs to be performed on dozens of e‑commerce platforms.
WALT sidesteps these pitfalls by treating the web page as a black box that exposes a set of callable operations. The agent first observes the page, identifies interactive elements, and maps them to semantic actions such as search(query), filter(criteria), or post_comment(content). Once this mapping is established, the agent can invoke these operations directly, without navigating the UI. This abstraction mirrors how developers use APIs: they call a function with parameters and receive a response, rather than manipulating the UI manually.
The WALT Architecture
At the heart of WALT is a two‑stage pipeline. The first stage, Tool Discovery, scans the page’s DOM, event listeners, and network traffic to infer a set of potential tools. It leverages heuristics and lightweight machine learning models to cluster similar elements and assign them semantic labels. For example, a search bar paired with a submit button is labeled as a search tool, while a dropdown menu that filters results becomes a filter tool.
The second stage, Tool Invocation, integrates with the LLM’s reasoning loop. When the agent receives a user instruction—such as “Find the cheapest laptop under $800”—the LLM generates a plan that includes calls to the discovered tools. The plan is then executed: the agent calls search("laptop", max_price=800), receives the results, and may further invoke filter("brand: Dell"), sort("price asc"), and post_comment("Found a great deal!"). The response is then parsed and the next step is planned.
This architecture offers several advantages. First, it decouples the agent’s reasoning from the specifics of the UI, making it resilient to layout changes. Second, it encourages modularity: new tools can be added to a site without retraining the entire model. Third, it aligns with the broader trend of tool‑augmented LLMs, where external APIs are leveraged to extend the model’s capabilities.
Tool Discovery and Invocation
The discovery process is largely unsupervised. WALT monitors user interactions and network requests to infer the purpose of each element. For instance, a button that triggers a POST request to /api/comments is likely a post_comment tool. The system then generates a concise signature for the tool, such as post_comment(content: string) -> bool, and stores it in a local registry.
When the agent receives a task, the LLM first decomposes it into sub‑tasks that map to available tools. It then generates a tool‑call plan, a sequence of function calls with arguments. The plan is executed in a sandboxed environment that captures the tool’s output and any side effects. If a tool call fails—perhaps due to a network error—the agent can retry or fallback to an alternative tool.
Because the tools are reusable, the agent can learn from experience. If a particular tool consistently yields better results, the agent can prioritize it in future plans. Conversely, if a tool is unreliable, the agent can flag it for review or remove it from the registry.
Practical Use Cases
WALT’s framework is already proving useful across a range of scenarios. In e‑commerce, an agent can automatically compare prices across multiple marketplaces by invoking the search and filter tools on each site, aggregating the results, and presenting a consolidated list to the user. In customer support, a bot can navigate a knowledge‑base portal, call the search tool with the user’s query, and then use post_comment to submit a ticket if no answer is found.
Another compelling use case is data extraction. Traditional scraping requires writing site‑specific parsers. With WALT, an agent can discover the export_csv tool on a data portal, call it with the desired parameters, and retrieve structured data without parsing HTML. This dramatically reduces the time to deploy new scrapers and improves maintainability.
The framework also shines in dynamic environments such as social media. An agent can learn the post_status tool on a platform, use search to find trending topics, and then compose and post a status automatically. Because the tools are abstracted, the same agent can be deployed on multiple platforms—Twitter, LinkedIn, Reddit—by simply adding the corresponding tool signatures.
Implications for AI Development
WALT represents a significant step toward more generalizable, interpretable, and robust web agents. By treating web interactions as tool calls, developers can reason about agent behavior in a modular way, trace execution paths, and debug failures more efficiently. The approach also dovetails with the emerging field of tool‑augmented LLMs, where external APIs and services are seamlessly integrated into the model’s reasoning loop.
From a research perspective, WALT opens new avenues for studying how agents learn to discover and use tools. Researchers can analyze the discovery process, evaluate the quality of inferred tool signatures, and develop algorithms that improve tool coverage and reliability. Moreover, the framework encourages the creation of shared tool registries, fostering collaboration and standardization across the AI community.
Finally, WALT’s abstraction layer has practical business implications. Enterprises can deploy intelligent agents that adapt to changing web interfaces without costly rewrites. The reduced reliance on large, monolithic prompts also lowers the computational overhead, making real‑time web automation more feasible.
Conclusion
Salesforce’s WALT framework redefines how large language models interact with the web by shifting from brittle UI automation to reusable, callable tools. This paradigm not only enhances robustness and scalability but also aligns web automation with software engineering best practices. As the AI ecosystem continues to evolve, frameworks like WALT will play a pivotal role in unlocking the full potential of web‑enabled agents, enabling them to perform complex tasks across diverse platforms with minimal human intervention.
Call to Action
If you’re intrigued by the promise of tool‑based web agents, start experimenting with WALT today. Explore the open‑source repository, try out the demo on a few websites, and share your findings with the community. By contributing to the tool registry and refining the discovery algorithms, you can help shape the next generation of intelligent, adaptable web automation. Join the conversation on GitHub, Twitter, or the Salesforce AI Research forum, and let’s build a future where LLM agents can seamlessly navigate the web, one reusable tool at a time.