The ability to control computers through natural language (NL) instructions harbors enormous potential, as it allows for the automation of various tasks. This blog post delves into the intricacies of agents, which execute actions on computers based on NL commands.
We define the task of computer control as follows: The agent operates within a computer environment, such as a personal computer, notebook, or smartphone. We denote the state space of the computer environment as \(\mathcal{S}\), with a single state of this environment at time \(t\) represented as \(s_t \in \mathcal{S}\). Typically, the agent only observes a partial aspect of this state, denoted as \(o_t \in \mathcal{O}\), where \(\mathcal{O}\) is the observation space. This partial observation, \(o_t\), may lack certain details present in \(s_t\). For instance, while \(s_t\) could represent a personal computer with multiple running processes, the agent’s observation \(o_t\) might be limited to a pixel-based screenshot captured from the user interface.
The agent can perform various actions denoted as \(a_t \in \mathcal{A}\), within the action space \(\mathcal{A}\). The action space might include a low-level interface, such as the computer’s keyboard or mouse, or a higher-level interface, such as an API that can be called programmatically and then performs actions within the environment. Typically, a stochastic policy \(\pi\) determines the action to take based on the agent’s observation and a given natural language instruction denoted as \(i\):
\begin{equation} a_t \sim \pi(\cdot | o_t, i) \end{equation}
Note that we do not use a subscript \(t\) for the instruction \(i\), as it is typically provided at the beginning of a trajectory sequence \(\tau = (o_0,a_0, o_1, a_1, ...)\). Consequently, after receiving an instruction \(i\), such as “Open the calculator app,” the agent captures a representation of the system’s state \(o_0\) (e.g., a screenshot) and then samples and executes an action \(a_0 \sim \pi (\cdot \mid o_0, i)\) (e.g., clicking on a symbol). This process of observing and executing actions iteratively repeats until a stop criterion is met.
This blogpost does not aim to provide an exhaustive overview of existing approaches in computer control. Nonetheless, it is essential to highlight the two primary paradigms that have been employed to tackle this task: Reinforcement learning (RL) and large language models (LLMs).
Reinforcement learning methodologies typically operate within two types of observation spaces \(\mathcal{O}\): These spaces consist of either raw pixel data representing the user interface, a more abstract representation such as the Document Object Model (DOM) in the context of web browsers, or a combination of both. The action space \(\mathcal{A}\) is commonly defined by actions such as mouse movements, clicks, or keyboard inputs. The agent’s policy \(\pi\) is usually acquired through iterative interaction with the environment, during which the agent receives feedback in the form of a reward signal based on the efficacy of its actions. This reward signal is often sparse and delayed, presenting a significant challenge in the learning process.
In contrast, LLMs harness the capabilities of natural language processing (NLP) to comprehend and carry out instructions conveyed in natural language. These models can either directly process textual descriptions of the user interface or utilize an additional model to convert pixel-level observations into textual representations. The set of actions, denoted as \(\mathcal{A}\), is typically delineated by high-level commands such as XML patterns or API calls, specifying how to engage with the environment. State-of-the-art (SOTA) LLMs surpass reinforcement learning (RL) agents in terms of both sample efficiency and adaptability to novel environments. This superiority stems from their pretraining on extensive corpora, endowing them with certain “reasoning” abilities that enable the formulation of step-by-step plans for interacting with the environment.
On personal computers, the majority of applications are navigated by users through a pixel-based UIs (as \(\mathcal{O}\)), alongside input devices such as keyboards and mice (\(\mathcal{A}\)). The goal of this research is the development of an autonomous agent proficient in comprehending natural language instructions and effectively controlling the computer system akin to human interaction. Specifically, the agent should utilize the pixel-level screen as its observation space \(\mathcal{O}\) and the keyboard and mouse as its action space \(\mathcal{A}\). Using these interfaces, the agent should be able to execute a wide range of tasks on various (business) applications and not be limited to a few apps that provide specific interfaces (such as APIs, HTML source code, etc.). This agent should possess the capability to interpret natural language instructions and execute tasks delineated by human operators. Although this endeavor is more complex compared to using higher-level interfaces, it offers unparalleled versatility with potential applicability across diverse computing applications.
Current research focuses especially on creating policies \(\pi\) that allows agents to act on limited observation spaces \(\mathcal{O}\) such as HTML from web browsers, the view hierarchy from Android, etc. Hence, SOTA models typically require textual descriptions of the user interface, which are for most applications not available. Other approaches, claiming that they can work with pixel-based observations, are often pre-trained on screenshot to text translation, allowing to process text instead of pixel observations. However, these applications typically do not scale well as the pre-training focuses on specific domains such as web browsers and does not consider, for example, older CRM systems. Approaches that directly work with pixel-based observations are still in their infancy and perform surprisingly bad.
I argue that the missing ingredient for the success of computer control agents is the ability to understand UI elements and that it is key to derive actionable representations. Having actionable representation allows a model to understand the consequences of actions, i.e., it can predict the next state of the environment given an action. Such representations are fundamental for planning and reasoning, combining these with existing models would allow to act on various applications.
In the following, I outline the main limitations of current approaches.
Directly prediction actions from pixel-based observations is highly inefficient and does not work at all. For example, for simple web-tasks, most SOTA approaches use more than 25k demonstrations per task (Humphreys et al. 2022), whereby the tasks are highly simplified (to a degree where they can be considered unrealistic). Since we want to control the entire computer, the number of samples required would be enormous and we cannot ask users to execute simple tasks 25k times before we can reproduce them. Also, the argument of Yann LeCun that building generative models for sensory inputs is often intractable is likely to hold here as well.
“If your goal is to train a world model for recognition or planning, using pixel-level prediction is a terrible idea. Generation happens to work for text because text is discrete with a finite number of symbols. Dealing with uncertainty in the prediction is easy in such settings. Dealing with prediction uncertainty in high-dimension continuous sensory inputs is simply intractable. That’s why generative models for sensory inputs are doomed to failure.”
– Yann LeCun
This does not mean that we cannot use pixel-based observations to predict actions. However, we might need to incorporate constraints (restrict the latent space so that it becomes discrete and finite) to do so more efficiently.
From preliminary investigations, I believe that mapping UIs to hierarchical descriptions of its element and learning actions on this level is highly promising This is also supported by other work (Ma, Tsao, and Shum 2022) arguing that predictions should be made in the latent space. Hierarchical descriptions can be considered a much lower-dimensional space and allows to predict the consequences of actions in a more sample-efficient manner (towards the parsimony principle described by (Ma, Tsao, and Shum 2022)). Furthermore, encoding a screenshot to a hierarchical description and predicting how an action will change this description fulfills the self-consistency principle described by (Ma, Tsao, and Shum 2022). This approach is closely related to the homomorphic autoencoder described by (Keurti et al. 2023) who learns to perform action in a latent space. Furthermore, having a hierarchical description of the UI allows to align the elements with language instructions.
I identify two types of binding problems that have to be solved: First, the binding between UI elements. Second, the binding of pixels to UI elements.
The first binding problem is the problem of understanding how UI elements are related to each other. This problem could be solved by learning a hierarchical description of the UI. When we know which button and text belong together, we can derive the meaning behind the button. This is a problem that is solved by humans instantly when looking at a UI as we have this understanding between relations. Furthermore, the observation space becomes much smaller as we move from pixel-based observations to a hierarchical description of the UI. I believe that these are the main reasons for the success of LLMs in computer control.
The second binding problem is the problem of understanding which pixel belongs to which UI element. As outlined above, SOTA approaches require huge amounts of training data, e.g. 25k demonstrations for solving simple web tasks (Humphreys et al. 2022). I believe that observing actions on pixel level is highly inefficient: There are countless ways how one can move the mouse to a button, on which pixel of the button is clicked, etc. Other datasets provide actions as textual descriptions (e.g. button “xy” is clicked). However, creating such annotations is costly. Therefore, I argue that we should not only predict hierarchical descriptions of the UI but also the bounding boxes of the UI elements. This allows to automatically convert humans actions conducted on pixel level into descriptive actions that can be linked to the hierarchical description of the UI. So far, I am not aware of any work that predicts the bounding boxes of UI elements from pixel-based observations.
It is evident that LLMs perform much better than RL agents for computer control. Especially providing them HTML descriptions of the UI and actions allows LLMs to understand the structure of the UI. I hypothesize that LLMs are more efficient because they are working on this symbolic level and providing HTML solves the aforementioned binding-problems. When we only describe the UI elements without their relations, the LLMs are cannot perform well and there is a hughe drop in performance. The same goes if they are provided with action detailed action trajectories instead of high-level actions (such as button was clicked). This is weak evidence as it is only based on my observations and not on a systematic study. However, it shows the importance of not just knowing the UI elements but also their relations.
There exist some research mapping screenshots to structured text. However, these model perform surprisingly bad and completely fail to describe UIs on a meaningful level or to generalize. OCR models seem to work ok to detect text in screenshots, but sometime struggle with the diverse text arrangements and styles. Models for layout recognition do not work at all when optimized for documents as computer UIs are completely different structured. It could be worth to investigate why these models fail and how they can be improved (or if exactly the same models can be used but must be tweaked on different datasets).
A lot of research is focusing on planning action sequences for computer control. Thereby, mainly RL, LLMs, large language-vision models, and search algorithms are used. This is a highly dynamic field with a lot of research output. So far, I couldn’t really identify a gap to contribute to this field without having the risk of being too slow and not able to compete with the current state of the art. However, we could focus on other areas such as “unsupervised transfer-learning” to run these models on unseen applications as well.
Surprisingly little research is done on how to use humans in the loop despite it seems extremely important. Current systems are optimized to execute actions and, thereby, focus on doing the best possible action for a given instruction. However, in a real-world scenario, we humans tend to forget specifying things, make mistakes, etc. If you have an assistant or employees, think about how often they ask back to you for clarification, confirmation, etc. Imaging the following cases:
I believe that solving these questions is highly important and that it is a highly under-researched field. Some research, such as chain-of-thought reasoning, show that agents are well capable of identifying weaknesses in their action plans. They try to address these weaknesses by asking questions to themselves. However, as long as humans provide imperfect instructions, they will always have to ask back. And since we humans almost always ask back to our superiors, showing that we are not able to provide the perfect instruction, I believe that this is a highly important field to research. Furthermore, it would allow agents to learn much more about us, providing them additional knowledge that could be used in future tasks.
Based on these insights and the current state of the art, I believe that the following questions will be answered by others:
Maybe, instead of LLMs, novel research will use hybrid vision-language models to control computers. However, the principle remains the same.
I believe that others will not focus on the following question:
Based on this insights, I identify three main research questions:
The first question deals with the problem of understanding the UI screen and the consequences of actions. These are considered rather understanding atomic elements (e.g., clicking button “xy” creates a new document). As previously outlined, this task includes building hierarchical descriptions of UI elements to capture their relations. The second question deals with the problem of breaking down a high-level instruction into multiple atomic elements (e.g. to write a letter: Open a new document, select font, …). This can be done, for example, by leveraging human demonstrations or allowing the agent to click around and observe the consequences of actions.
These questions include the following sub-questions:
We will conduct a comprehensive survey of existing work in computer control.
Subsequently to the survey, there exist various directions to proceed. My current impression is as follows:
From a research perspective, I believe that the first and third question are the most promising to contribute to. They will both have an big impact on the field and are highly under-researched. I see solving question 1 as the most fundamental one to get the system running and question 3 as the low hanging fruit that can lead to a solid publication with reasonable effort. Therefore, it might be beneficial to work on both questions in parallel.
I plan in the following steps:
Most of the current research is about pushing the scores on a few existing benchmarks. The knowledge is typically derived from predicting HTML from screenshots and/or interacting with a virtual environment. However, from a research viewpoint, computers provide a highly interesting environment to study intelligence.
Therefore, analyzing the binding problem and Gestalt phenomena seems to be easier in this environment than in the real world.
Humans have internal models of the external world that represents all behavior-relevant information. These models are acquired by interacting with the world and are used to predict the consequences of actions. Neuroscientific findings suggest that humans, when acting, integrate efference copies of motor signals with incoming sensory observations to predict future sensory inputs (Keller et al., 2012). Hypotheses in psychology (Piaget, 1964) suggest that perceiving an object is not creating a mental copy of it but rather understanding of how this object behaves under different interventions.
Existing research in computer vision uses these insights and focuses on building actionable representations for objects within 3D worlds (Keurti et al. 2023). This work encodes images and actions to embeddings and builds consistency between transformed images in the image space and latent space. The principle is highly similar to the proposed approach: For a computer agent, we first transform the high-dimensional UI to a lower-dimensional hierarchical description and predict action consequences in this space.
Current (unpublished) work of INI UZH/ETH focuses on learning relation between objects. Again, this is highly similar as we build representations of UI elements using a hierarchical description of the UI that captures these relationships.