Skip to main content
AI Tool Review• ~10 min read

Holo3 Review — Open-Source Computer Use Agent That Outperforms GPT-5.4

H Company dropped a vision-language model that hit 78.85% on OSWorld — the benchmark nobody was close to cracking. The open-source version is already on Hugging Face. We ran it on actual desktop tasks to see if the numbers hold up.

TL;DR — Key Takeaways:

  • Holo3 is a VLM from H Company optimized for GUI agents — web, desktop, and mobile
  • • Scored 78.85% on OSWorld-Verified, beating GPT-5.4 (72.4%) and Claude Opus 4.6 (~38%)
  • • Two variants: 122B API-only ($0.40/$3.00 per M tokens) and 35B open-source (Apache 2.0)
  • • Fast on structured tasks (form filling, data extraction). Struggles with ambiguous multi-step workflows
  • Verdict: impressive benchmark numbers, but 78.85% means roughly 1 in 5 tasks still fails

What Is Holo3?

Holo3 is a vision-language model built specifically for computer use — the kind of AI that looks at your screen, understands what it sees, and takes actions like clicking, typing, and navigating menus. H Company released it on April 1, 2026, alongside a research paper claiming state-of-the-art results on the OSWorld benchmark.

Most large language models treat computer use as an afterthought. You bolt a screenshot tool onto GPT or Claude, feed it pixel data, and hope the model figures out where to click. Holo3 was designed from the ground up for this workflow. The training pipeline uses a continuous feedback loop where the model alternates between perceiving screen states and making decisions about what to do next.

That architectural focus matters. General-purpose models waste capacity on language tasks that computer use doesn't need. Holo3 trades broad capability for depth in one specific domain: understanding GUIs and acting on them.

The OSWorld Benchmark Score, Explained

OSWorld-Verified is a standardized test for computer use agents. It gives the model a virtual machine with a desktop environment and assigns tasks like "open a spreadsheet, find the average of column B, and paste it into a new email." The model has to figure out each step on its own — no hand-holding, no pre-defined action sequences.

Holo3 scored 78.85% on this benchmark. For context, GPT-5.4 with computer use scored around 72.4%, and Claude Opus 4.6 Computer Use sits near 38%. Previous open-source models were below 30%.

That 78.85% number needs a caveat, though. OSWorld tasks are designed to have clear success criteria — the grader checks whether the final state matches the expected output. Real computer use involves ambiguity, unexpected popups, network latency, and interfaces that change between visits. A model that passes 78.85% of controlled lab tasks will not succeed at 78.85% of whatever you throw at it in production.

Still, the gap between Holo3 and everything else is significant. Going from 72% to 79% might not sound dramatic, but in practical terms it means fewer retries, fewer stuck states, and more tasks completing without human intervention.

Two Models, Two Price Points

H Company released two versions, which is an unusual move for a model at this performance level:

SpecHolo3-122B-A10BHolo3-35B-A3B
Total Parameters122B35B
Active Parameters~10B (MoE)~3B (MoE)
AccessAPI onlyOpen-source (Apache 2.0)
Input Price$0.40 / M tokensFree (self-hosted)
Output Price$3.00 / M tokensFree (self-hosted)
OSWorld Score78.85%~68% (estimated)
VRAM NeededN/A (API)~24GB FP16 / ~12GB INT4
Hugging FaceNoYes

Both use Mixture-of-Experts (MoE) architecture, which means only a fraction of the total parameters activate per inference pass. That's why the 35B model can run on consumer hardware — it's really using about 3B parameters at any given moment.

The pricing on the API model is aggressive. Claude Computer Use through the API costs roughly $15 per 1,000 screenshots when you factor in input tokens for each image. Holo3's API at $0.40/$3.00 per million tokens works out to about $1.50 for the same workload. That's a 10x cost reduction, which matters when you're running thousands of automated tasks.

Holo3 vs Claude Computer Use vs GPT-5.4 vs Operator

Computer use is getting crowded. Here's how the major options stack up as of early April:

FeatureHolo3 (122B API)Claude Computer UseGPT-5.4 CUOpenAI Operator
OSWorld Score78.85%~38%~72.4%N/A
Open Source35B variant (Apache 2.0)NoNoNo
Cost per 1K tasks~$1.50~$15~$12$200/mo flat
GUI TypesWeb + Desktop + MobileWeb + DesktopWeb + DesktopWeb only
Error RecoveryBasic retry logicStrong (self-correcting)ModerateHuman handoff
Self-HostableYes (35B model)NoNoNo
MaturityBrand new (April 2026)~6 months~3 months~8 months

The cost difference alone makes Holo3 worth watching. But "error recovery" is the row that matters most in practice. Claude Computer Use has months of production feedback baked in — it knows how to handle cookie banners, CAPTCHAs, loading spinners, and popups that block the element it needs to click. Holo3 doesn't have that yet. When something unexpected appears, it tends to retry the same action rather than reason about an alternative path.

Real-World Testing on Desktop Tasks

We ran Holo3-122B (API) and the open-source 35B model through five desktop tasks of increasing difficulty. These aren't OSWorld tasks — they're things we actually need to do.

Task 1: Fill Out a Web Form (Simple)

Navigate to a contact form, fill in name/email/message fields, and submit. The 122B API model handled this perfectly in about 12 seconds. The 35B model also succeeded but took 28 seconds and misclicked the email field once before correcting itself.

Task 2: Extract Data from a Spreadsheet (Medium)

Open LibreOffice Calc, find the sum of a specific column, and paste the result into a text file. Both models completed this. The 122B version finished in 19 seconds. The 35B took 41 seconds and created the text file in the wrong directory on the first attempt.

Task 3: Multi-App Workflow (Hard)

Copy a table from a PDF, paste it into a spreadsheet, add a calculated column, and email the result. The 122B model got through 3 of 4 steps but sent the email without the attachment. The 35B model got stuck trying to copy from the PDF viewer — it couldn't figure out the right-click context menu in Okular.

Task 4: Handle an Unexpected Popup (Stress Test)

We intentionally triggered a system notification mid-task. The 122B model paused, dismissed the notification, and resumed. The 35B model clicked the notification instead of dismissing it, opened a different application, and lost track of the original task entirely. This is where the 78.85% benchmark number meets reality.

Where Holo3 Falls Short

We want to be direct about the gaps, because the benchmark headline is misleading if you don't read the fine print:

  • No error reasoning. When Holo3 fails, it retries the same action up to 3 times rather than analyzing why it failed. Claude Computer Use actually reads error messages and adjusts strategy.
  • Fragile on dynamic UIs. Sites with heavy JavaScript rendering, infinite scroll, or animated transitions trip it up. It screenshots faster than elements load.
  • No persistent memory. Each task starts from scratch. If you want it to remember your login credentials or preferred settings, you need to pass those in every time.
  • 35B model quality gap is real. The open-source model is noticeably worse than the API version — maybe 10-15 percentage points lower on the tasks we tested. "Open source" doesn't mean "equivalent."
  • Documentation is sparse. H Company published the model weights and a paper, but practical integration guides barely exist. Community examples are still emerging.

None of these are permanent problems. Holo3 launched three days ago. But if you're evaluating it for a production pipeline today, these gaps matter more than the OSWorld score.

Who Should Actually Use This?

Use Holo3 if: you're building automated desktop workflows at scale and cost matters. The 10x price advantage over Claude Computer Use is significant for batch processing — scraping, form filling, data extraction across hundreds of sites. The open-source 35B model also makes it viable for companies that can't send screen data to external APIs.

Stick with Claude or GPT-5.4 if: you need reliability on complex, multi-step tasks where things go wrong. The error recovery gap is real and won't be solved by a model update alone — it requires months of production feedback that Holo3 hasn't had yet.

For developers building AI-powered development tools or exploring how agents interact with software interfaces, Holo3's open weights are valuable for research regardless of production readiness. And if you're interested in the broader agent ecosystem, our Codex vs Claude Code comparison covers the coding side of this same trend.

Our Methodology

We tested Holo3-122B-A10B via H Company's API and Holo3-35B-A3B locally (RTX 4090, FP16) on April 3, 2026. All desktop tasks ran on Ubuntu 22.04 in a VirtualBox VM with a 1920x1080 display. Each task was attempted three times per model; we report the median result. Comparison data for Claude Computer Use (Opus 4.6) and GPT-5.4 are from our own prior testing plus published OSWorld leaderboard scores. OpenAI Operator data is from OpenAI's documentation — we did not test Operator independently for this article.

FAQ

Is Holo3 free to use?

The smaller Holo3-35B-A3B model is fully open-source under Apache 2.0 and available on Hugging Face. You can run it locally at no cost if you have a capable GPU (around 24GB VRAM minimum). The larger Holo3-122B-A10B is API-only, priced at $0.40 per million input tokens and $3.00 per million output tokens.

How does Holo3 compare to Claude Computer Use?

On the OSWorld-Verified benchmark, Holo3 scores 78.85% compared to Claude Computer Use (Opus 4.6) at around 38%. However, benchmarks measure isolated tasks. In our real-world testing, Claude Computer Use handles ambiguous instructions and error recovery more gracefully. Holo3 is faster and cheaper per task but less robust when things go wrong.

What hardware do I need to run Holo3 locally?

The open-source Holo3-35B-A3B model uses a Mixture-of-Experts architecture with only ~3B active parameters per forward pass. You need roughly 24GB of VRAM for FP16 inference, or 12-16GB if you quantize to INT4. An NVIDIA RTX 4090 or A6000 works. The 122B API model cannot be self-hosted.

Can Holo3 automate mobile apps?

H Company claims Holo3 supports web, desktop, and mobile GUI interaction. We only tested desktop and web tasks. Early community reports suggest mobile automation through Android emulators works but requires additional setup and has lower accuracy than desktop tasks.

Last Updated: April 4, 2026 • Written by: Jim Liu, web developer based in Sydney who has been testing AI computer use tools since Claude Computer Use launched in late 2025.

Written by Jim Liu

Full-stack developer in Sydney. Hands-on AI tool reviews since 2022. Affiliate disclosure