A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

Zurich University of Applied Sciences | University of Zurich | ETH Zurich | AlpineAI AG
ArXiv 2025

*Indicates Equal Contribution
Overview

Overview: (a) An example of a task for an agent for computer use (ACU): A user specifies a task ("propose meeting dates by email") and the agent executes it. (b) We structure the literature on ACUs based on three key perspectives corresponding to the main differentiating aspects: (1) The shared domain properties across computer environments (e.g., Web, Android). (2) The means of interaction between the agent and the environment as manifested in the observation and action spaces. (3) The agent components: how an agent acts through a policy while tracking the past in memory and how an agent learns to act.

Abstract

Background: Agents for computer use (ACUs) are systems that execute complex tasks on digital devices -- such as personal computers or mobile phones -- given instructions in natural language. These agents automate tasks by controlling software through low-level actions like mouse clicks and touchscreen gestures. However, despite rapid progress, ACUs are not yet mature for everyday use.

Objectives: This survey examines the current state-of-the-art, identifies trends, and points out research gaps in the development of practical ACUs. The goal is to provide a comprehensive review and analysis that helps advance general-purpose, robust, and scalable agents for real-world computer use.

Methods: We introduce a multifaceted taxonomy of ACUs across three dimensions: (I) the domain perspective, characterizing the contexts in which agents operate; (II) the interaction perspective, describing observation modalities (e.g., screenshots, HTML) and action modalities (e.g., mouse, keyboard, code execution); and (III) the agent perspective, detailing how agents perceive, reason, and learn. We review 87 original research papers about ACUs and 33 relevant datasets, covering both foundation model-based and specialized approaches.

Results: Our taxonomy comprehensively structures state-of-the-art approaches and establishes the groundwork for guiding future ACU research. We found that the field is transitioning from specialized agents toward foundation-model-based agents, a shift from text to image-based observation space, and an increasing adoption of behavior cloning methodologies. Furthermore, we identify six key research gaps: insufficient generalization, inefficient learning, limited planning, low task complexity in benchmarks, non-standardized evaluation, and a disconnect between research and practical conditions.

Conclusions: To continue rapid improvements in the field, we recommend focusing on: (a) vision-based observations and low-level control to enhance generalization; (b) adaptive learning beyond static prompting; (c) effective planning and reasoning capabilities; (d) realistic, high-complexity benchmarks; (e) standardized evaluation criteria based on task success; and (f) aligning agent design with real-world deployment constraints. Collectively, our findings and proposed directions help develop more general-purpose agents for everyday digital tasks.

Sankey Diagram

Agent Overview

Download CSV

Datasets Overview

Download CSV

BibTeX

@misc{sager_acu_2025,
      title={A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions},
      author={Pascal J. Sager and Benjamin Meyer and Peng Yan and Rebekka von Wartburg-Kottler and Layan Etaiwi and Aref Enayati and Gabriel Nobel and Ahmed Abdulkadir and Benjamin F. Grewe and Thilo Stadelmann},
      year={2025},
      eprint={2501.16150},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2501.16150},
}