Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflow

🧭 Introduction

π-BENCH is a benchmark for proactive personal assistant agents in long-horizon workflows, where users start with underspecified requests and important requirements emerge across interaction. It contains 100 multi-turn tasks across 5 domain-specific personas (researcher, marketer, pharmacist, law_trainee, financier) and organizes them as multi-session episodes in persistent workspaces.

The benchmark jointly measures Proactivity (PROC) and Completeness (COMP). PROC evaluates whether an agent resolves hidden intents early (through inference or focused elicitation) to reduce avoidable user burden, while COMP evaluates whether final deliverables satisfy checklist requirements and artifact-level obligations. Scoring combines rubric-based hidden-intent judgment and checklist validation, and audit results show low judge disagreement (<4%), which supports evaluation reliability.

Compared with benchmarks focused mainly on short-horizon tasks, GUI/mobile interactions, or memory retrieval alone, π-BENCH emphasizes persistent, artifact-centric workflows with hidden intents, inter-task dependencies, and cross-session continuity, enabling clearer separation between reactive task completion and proactive assistance quality.

🚀 Bench Public Code

This directory contains the public code release for running the Agent Arena benchmark. It includes the benchmark runner, task data, runtime configuration, model configuration templates, and the locally modified runtime source trees mounted into the prebuilt container:

code/
  src/          # benchmark Python package
  data/         # benchmark users, profiles, episodes, tools, skills, and tasks
  config/       # benchmark YAML and nanobot model configs
  scripts/      # launchers, container entrypoint, image loader, test server
  third_party/
    appworld/   # locally modified AppWorld source tree
    nanobot/    # locally modified nanobot source tree

The Docker launcher mounts these directories into the prebuilt image localhost/bench:v1. The launcher does not build the image. It expects the image to have been loaded from the release archive before a benchmark run.

third_party/appworld/ and third_party/nanobot/ are public assets used by this benchmark with local modifications. The paper cites these assets, and this release preserves and follows their corresponding licenses.

✅ Prerequisites

Before running the benchmark, prepare:

Docker or Podman on the host machine.
Python and pip for the one-time AppWorld data setup.
Network access to the LLM provider endpoint used by the selected model configuration.
API keys for the model provider, benchmark evaluation LLM endpoint, and Brave Search.
The prebuilt image archive docker-image.tar.gz from the anonymous OSF project linked below.

If your network requires a proxy, set the usual host proxy variables:

export HTTP_PROXY="http://<proxy-host>:<proxy-port>"
export HTTPS_PROXY="http://<proxy-host>:<proxy-port>"
export NO_PROXY="127.0.0.1,localhost"

The launcher forwards these values into the container and prints a warning when any are missing.

🧰 Setup

Prepare AppWorld data once from this code/ directory:

cd third_party/appworld
pip install -e .
appworld install
appworld download data
appworld install --repo
appworld download data
cd ../..

Download the prebuilt Docker image archive docker-image.tar.gz from the anonymous OSF project:

https://osf.io/dnb5k/overview?view_only=5ff4eeb8c32a417fad136a54735c44db

Then load the benchmark image from this code/ directory:

bash load_bench_image.sh /path/to/docker-image.tar.gz

If the image archive is copied to /tmp/localhost-bench.tar.gz, the path can be omitted:

bash load_bench_image.sh

The loader delegates to scripts/load_image.sh, checks for Docker or Podman, loads the archive, and verifies that localhost/bench:v1 is available.

⚙️ Configuration

The main benchmark config is:

config/bench/nanobot.yaml

Evaluation trace-history config lives under:

config/bench/evaluation/

Nanobot model configs live under:

config/nanobot/models/<model-id>/config.json

Available user ids are the directory names under data/: researcher, marketer, pharmacist, law_trainee, and Financier.

Available model ids are the directory names under config/nanobot/models/.

Before running Docker jobs, edit the user-config section near the top of scripts/run_parallel_models_docker.sh. The script intentionally keeps routine configuration in this section instead of reading benchmark credentials from environment variables.

Required values:

BENCH_LLM_BASE_URL
NANOBOT_PROVIDER_API_KEY
NANOBOT_BRAVE_SEARCH_API_KEY

Optional values:

OPENAI_API_KEY_FOR_BENCH, used by the benchmark evaluation LLM. If left empty, it falls back to NANOBOT_PROVIDER_API_KEY.
NANOBOT_PROVIDER_API_BASE; when empty, the container keeps the selected model config's providers.custom.apiBase.
DEFAULT_BENCH_USER_ID, currently law_trainee.
BENCH_TASK_IDS, left empty to run all tasks for the selected user.

To run only a subset of tasks, set BENCH_TASK_IDS in the script to a comma-separated list of task ids. Task ids are the task directory names under data/<user-id>/tasks/.

🏆 Leaderboard

Overall results for Proc / Comp (%). Results are averaged over three runs, with subscripts denoting standard deviation.

Model	Average Proc	Average Comp	Researcher	Marketer	Pharmacist	Law Trainee	Financier
GPT-5.4	67.0_2.1	65.6_1.8	46.0 / 66.4	78.2 / 67.1	75.9 / 71.5	56.9 / 61.9	78.1 / 61.2
Gemini 3.1 Pro	57.1_0.9	60.0_0.8	41.1 / 59.2	65.0 / 62.1	71.0 / 72.1	50.0 / 55.3	58.6 / 51.1
Claude Opus 4.6	65.5_1.4	67.6_1.5	50.3 / 74.5	75.0 / 74.6	82.8 / 68.6	45.7 / 57.2	73.8 / 63.2
DeepSeek V3.2	53.3_1.9	57.8_3.0	29.0 / 66.9	69.1 / 59.4	75.9 / 62.6	33.2 / 51.1	59.1 / 48.9
MiniMax M2.7	55.6_3.2	60.0_1.8	33.4 / 63.9	71.9 / 61.9	77.1 / 63.6	38.6 / 52.5	57.2 / 58.1
Kimi K2.5	43.1_0.2	61.6_1.9	28.9 / 63.5	41.2 / 62.3	70.1 / 74.8	34.8 / 54.4	40.4 / 52.9
Seed2.0 Pro	58.4_0.9	52.1_3.8	38.9 / 59.6	71.4 / 44.2	77.0 / 67.6	46.0 / 44.7	58.7 / 44.5
GLM-5.1	58.4_0.8	63.6_2.9	41.8 / 61.6	62.6 / 69.1	75.2 / 70.3	45.5 / 57.3	66.7 / 59.8
Qwen3.6 Plus	64.0_1.1	64.1_0.6	40.1 / 70.0	77.5 / 66.6	79.7 / 70.2	45.7 / 60.2	77.1 / 53.6

▶️ Run

Run from this code/ directory:

bash scripts/run_parallel_models_docker.sh --model-id deepseek-v3.2

Run multiple models:

bash scripts/run_parallel_models_docker.sh \
  --model-id deepseek-v3.2,MiniMax-M2.5

Run a specific user:

bash scripts/run_parallel_models_docker.sh \
  --user-id law_trainee \
  --model-id deepseek-v3.2

Run multiple users and models in one command:

bash scripts/run_parallel_models_docker.sh \
  --user-id researcher,law_trainee \
  --model-id deepseek-v3.2,MiniMax-M2.5

Repeated runs of the same model can be requested by repeating the model id. The launcher writes each repeated run to a distinct output directory with a __runNN suffix:

bash scripts/run_parallel_models_docker.sh \
  --user-id law_trainee \
  --model-id deepseek-v3.2,deepseek-v3.2,deepseek-v3.2

For a direct local import check or non-Docker invocation, run the benchmark package from the repository root so relative data/ paths resolve correctly:

python -m src.main \
  --config config/bench/nanobot.yaml \
  --history-config-path config/bench/evaluation/trace_history.yaml

🐳 Container Layout

scripts/run_parallel_models_docker.sh mounts host paths into the prebuilt container as follows:

src/                    -> /opt/proactive/src
data/                   -> /opt/proactive/data
config/                 -> /opt/proactive/config
scripts/                -> /opt/proactive/scripts
third_party/appworld/   -> /opt/proactive/appworld
third_party/nanobot/    -> /opt/proactive/nanobot

It also overrides the container entrypoint with:

/opt/proactive/scripts/entrypoint.sh

When AppWorld is enabled, the entrypoint starts AppWorld APIs and MCP, then starts the local test channel server from scripts/test_server.py, then starts nanobot gateway, runs benchmark --mode run, normalizes trace-log model directories when needed, and runs benchmark --mode eval.

AppWorld MCP is launched inside the container with user-specific tools from:

/opt/proactive/data/<user-id>/tools.yaml

📦 Outputs

Results and logs are written under:

outputs/<model-id>/<user-id>/

Per-task evaluation JSON files are written below each task's eval/results/ directory. The run also generates a trace viewer HTML file under the user output directory when evaluation completes.

Runtime logs for each container run are under:

outputs/<model-id>/<user-id>/run/<timestamp>-runtime/

Important runtime files include:

container.log, the benchmark process log for the container.
inspect.before.log and inspect.after.log, container inspection snapshots.
inspect.summary.log, a compact container status and exit-code summary.
service-logs/, nanobot service logs mounted from inside the container.

The launcher leaves containers available for inspection by default. To remove containers automatically after completion, set REMOVE_CONTAINER_ON_EXIT="true" near the top of scripts/run_parallel_models_docker.sh.

Do not commit generated outputs/ files.

🛠️ Troubleshooting

If the launcher exits before starting a run, check the printed validation message first. Common causes are:

localhost/bench:v1 has not been loaded.
Docker or Podman is not available.
Required API keys or base URLs are still blank in the script.
The requested --user-id is not present under data/.
The requested --model-id has no matching config/nanobot/models/<model-id>/config.json.

If a container starts but the run fails, inspect outputs/<model-id>/<user-id>/run/<timestamp>-runtime/container.log first, then check service-logs/ for nanobot-specific logs.

To inspect a running container:

docker ps
docker exec -it <container_id_or_name> /bin/bash

If the image does not include bash, use /bin/sh. For exited containers, use docker ps -a to find the container and docker cp to copy files out, or start the container again before docker exec.

Container-side nanobot runtime files are under /root/.nanobot/, especially:

/root/.nanobot/workspace
/root/.nanobot/logs
/root/.nanobot/config.json

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
config		config
data		data
page		page
scripts		scripts
src		src
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
image2.png		image2.png
load_bench_image.sh		load_bench_image.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflow

🧭 Introduction

🚀 Bench Public Code

✅ Prerequisites

🧰 Setup

⚙️ Configuration

🏆 Leaderboard

▶️ Run

🐳 Container Layout

📦 Outputs

🛠️ Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflow

🧭 Introduction

🚀 Bench Public Code

✅ Prerequisites

🧰 Setup

⚙️ Configuration

🏆 Leaderboard

▶️ Run

🐳 Container Layout

📦 Outputs

🛠️ Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages