QA.tech aims to develop an AI-powered web agent capable of evaluating websites in a manner similar to human testers. Given a natural language instruct

Building AI-Powered Web Agents: Identifying Actions with Vision and HTML Embeddings

submited by
Style Pass
2024-09-25 10:30:06

QA.tech aims to develop an AI-powered web agent capable of evaluating websites in a manner similar to human testers. Given a natural language instruction describing a test case, the AI agent autonomously navigates through websites, simultaneously identifying and documenting potential bugs and issues it encounters.

To train an agent to complete tasks on a website, we first need a standardized format to represent the websites. We record each website as a graph, similar to a sitemap, consisting of pages and the possible interactable elements (actions) available on each state of the page. A key challenge lies in creating unique and consistent representations for these actions to ensure that we do not record duplicate actions in the graph. For example, the “log in” button on Site X should be identifiable as the same action, even if the page undergoes changes, such as when dark mode is enabled or a dropdown menu is opened.

As an initial approach for creating unique action identifiers, we explored a non-Machine Learning method based on page screenshots and hashing. For each interactable element on the page, we defined an action context—the element and some neighborhood surrounding it—and captured a screenshot of this portion of the page. The unique identifier was generated by encoding and concatenating two components: the screenshot of the action context and the element’s relative position within that context, both hashed using MD5.

Leave a Comment