Ido Levy

Title

AI Research Scientist

Bio

Ido Levy is an AI Research Scientist at IBM Research–Haifa, where he designs and builds generalist computer use agents that reason, plan, and act autonomously. He co-created IBM CUGA, the first enterprise-ready agent to outperform OpenAI Operator on standard web-navigation benchmarks, and created ST-WebAgentBench, the field’s reference suite for safety and trust evaluation.

Before IBM he was an NLP data scientist at GE Healthcare, developing drift- detection models and MLOps pipelines for clinical text. Ido is also a graduate student in Data Science (M.Sc., Technion, advisers Yonatan Belinkov & Ron Meir) and holds a fast-track B.Sc. in Data Science & Engineering.

Research interests: generative AI · multi-agent orchestration · emergent communication · trustworthy AI · large-language-model tooling.

Publications

Governance by Construction for Generalist Agents
- - Segev Shlomov
  - Iftach Shoham
  - et al.
- 2026
- ACM CAIS 2026
Demo paper
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
- - Ido Levy
  - Ben Wiesel
  - et al.
- 2026
- ICLR 2026
Conference paper
AgentFixer: From Failure Detection to Fix Recommendations in Agentic Systems
- - Hadar Mulian
  - Sergey Zeltyn
  - et al.
- 2026
- ICSE 2026
Workshop paper
From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production
- - Segev Shlomov
  - Alon Oved
  - et al.
- 2026
- IAAI 2026
Conference paper
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents
- - Segev Shlomov
  - Ben Wiesel
  - et al.
- 2025
- ECAI 2025
Conference paper
ST-WEBAGENTBENCH: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
- - Ido Levy
  - Ben Wiesel
  - et al.
- 2025
- ICML 2025
Workshop paper

Top collaborators

Ido Levy

Title

Bio

Publications

Governance by Construction for Generalist Agents

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

AgentFixer: From Failure Detection to Fix Recommendations in Agentic Systems

From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production

From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

ST-WEBAGENTBENCH: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Top collaborators

Sami Marreed

Asaf Adi

Roy Abitbol

Hadar Mulian