Login
Back to Blog

youtu-vita OCR Benchmark 2026: Live Test Results on Documents, Receipts, UI Screens, and Small Text

We ran a live OCR benchmark for youtu-vita on eight image-understanding tasks, including documents, receipts, UI screenshots, rotated pages, scene text, and low-resolution small text. Here are the actual results, latency numbers, weak spots, and what they mean for production OCR workflows.

C
Crazyrouter Team
June 24, 2026 / 6 views
Share:

youtu-vita OCR Benchmark 2026: Live Test Results on Documents, Receipts, UI Screens, and Small Text#

If you are evaluating OCR-capable multimodal models for production work, the question is not just whether a model can read text in an image. The real question is whether it can do it consistently, with structured outputs, across the kinds of inputs teams actually send in production: screenshots, receipts, tables, rotated documents, scene text, and low-resolution UI captures.

We ran a live benchmark for youtu-vita through our OpenAI-compatible API path and scored it on a controlled OCR test set. This article shares the actual test data, what the model passed, where it struggled, and what kind of OCR workloads it looks good at right now.

Test setup#

This benchmark was run on June 24, 2026.

We used a local generated OCR benchmark set with these eight cases:

  1. document_basic
  2. receipt_total
  3. ui_settings
  4. table_statement
  5. scene_text_signboard
  6. rotated_document
  7. low_res_small_text
  8. chart_with_legend

The benchmark images were generated locally so the test would be stable and reproducible, rather than depending on external image hosts. Each request used the same OpenAI-compatible image input shape and the same JSON-only output instruction. The model was asked to:

  • transcribe visible text
  • answer structured OCR questions
  • return machine-readable JSON

Scoring method#

Each case was scored on four dimensions:

  1. Format stability Did the model return valid structured JSON?
  2. OCR text match Did it correctly capture the required visible text?
  3. Regex-sensitive fields Did it preserve exact formats for fields like IDs or totals?
  4. Structured answers Did it answer the requested key fields correctly?

The final score per case is a weighted total. A score of 1.000 means the model fully passed that case under this benchmark.

youtu-vita live benchmark results#

The full 8-case run completed successfully.

Headline metrics#

  • Success rate: 100%
  • Average total score: 0.875
  • p50 latency: 3794 ms
  • p90 latency: 4088 ms
  • Slowest case: 15284 ms

Case-by-case results#

Case IDCategoryStatusScoreLatency
document_basicDocument OCR2001.0003852 ms
receipt_totalReceipt OCR2001.0003616 ms
ui_settingsUI screenshot OCR2001.0003794 ms
table_statementTable OCR2001.0003434 ms
scene_text_signboardScene text2001.0001661 ms
rotated_documentRotated document2001.0002908 ms
low_res_small_textSmall text / low resolution2001.0004088 ms
chart_with_legendChart reasoning2000.00015284 ms

What youtu-vita did well#

For this run, youtu-vita was strong on the OCR tasks most teams care about first:

1. Clean document OCR#

It correctly extracted:

  • title
  • date
  • document ID
  • paragraph text

On document_basic, it returned a full structured transcription and correctly answered:

  • title = Quarterly Operations Summary
  • date = 2026-06-24
  • document_id = AB-123456

2. Receipt reading#

It handled receipt-style layout correctly on receipt_total, including:

  • merchant name
  • total amount
  • receipt number

That matters because receipt OCR often breaks on spacing, alignment, or repeated numeric fields. In this run, youtu-vita passed the receipt case with a full score.

3. UI screenshot OCR#

On ui_settings, it correctly captured:

  • page title
  • button labels
  • error code
  • supporting text

It also returned the structured answers we asked for:

  • primary_cta = Continue
  • error_code = E102

That makes it promising for support automation, QA workflows, screen parsing, and screenshot-based extraction tasks.

4. Table OCR#

On table_statement, the model passed the table case with a full 1.000 score in this run.

That is important because table OCR is often where vision models look good at plain text but fail at row-column alignment. In this benchmark, youtu-vita handled the table extraction cleanly enough to pass both the visible text requirements and the structured answer checks.

5. Rotated documents#

On rotated_document, the model also scored 1.000.

That suggests it is not limited to perfectly upright scanned pages. If your OCR workflow includes phone photos, skewed uploads, or documents captured in the wild, this is a meaningful result.

6. Low-resolution small text#

One of the most practically useful passes in this run was low_res_small_text, which also scored 1.000.

That case is closer to real dashboard and UI OCR than a clean printed PDF. If you need to read release notes, settings screens, logs, or admin panels from screenshots, this is a positive signal.

Where youtu-vita was weak#

The weak spot in this run was not standard OCR. It was chart reasoning.

On chart_with_legend, youtu-vita returned HTTP 200 but scored 0.000. It also took much longer than the rest of the test set at 15284 ms.

That tells us two things:

  1. The model can complete the request, but this benchmark did not show reliable performance on chart interpretation.
  2. OCR and chart understanding should be treated as separate capabilities.

This matters because many teams group all “image understanding” into one bucket. That is too coarse. A model can be strong at:

  • OCR
  • receipt parsing
  • screenshot reading
  • text extraction

and still be weak at:

  • chart reasoning
  • visual analytics
  • higher-order graph interpretation

Practical interpretation#

Based on this live run, youtu-vita looks strongest for these workloads:

Good fit#

  • document OCR
  • receipt OCR
  • UI screenshot extraction
  • rotated page reading
  • small-text screenshot parsing
  • sign and scene text extraction

Not yet a strong conclusion#

  • chart understanding
  • graph interpretation
  • analytics-style visual reasoning

If your workload is mostly “read the text, extract the fields, give me clean JSON,” this benchmark suggests youtu-vita is already useful.

If your workload is “understand a chart, infer trends, compare the latest month, and reason visually,” this benchmark does not support calling it strong there yet.

Why this matters for production teams#

Many OCR evaluations are too soft. They say a model is “good at image understanding” after a single logo or document test. That does not help if you need to decide whether to route:

  • support screenshots
  • invoices
  • receipts
  • internal ops dashboards
  • phone photos of documents

into a production OCR pipeline.

This benchmark is more useful because it separates:

  • OCR stability
  • structured extraction
  • small-text robustness
  • rotated-input handling
  • chart reasoning

For this run, youtu-vita was clearly stable across the OCR-heavy categories.

Final verdict#

For this live benchmark, youtu-vita was the most stable OCR model we tested in this run.

The actual results were:

  • 100% request success across the full 8-case OCR set
  • 0.875 average total score
  • 1.000 on 7 of 8 OCR-oriented cases
  • failure only on the chart reasoning case

That makes youtu-vita a strong candidate if your main need is text extraction from images, especially for:

  • business documents
  • receipts
  • UI screenshots
  • low-resolution text
  • rotated pages

It does not mean it is the best choice for every kind of vision workload. But if your problem is OCR, not visual analytics, this is one of the clearest positive live results we have seen in this benchmark.

Implementation Guides

Topics

Related Posts

AI Prompt Engineering Best Practices: The Developer's Guide for 2026Tutorial

AI Prompt Engineering Best Practices: The Developer's Guide for 2026

"Master prompt engineering for GPT, Claude, and Gemini. Learn proven techniques, templates, and best practices to get better results from any AI model."

Feb 27
MCP (Model Context Protocol) Complete Guide: The New Standard for AI Tool IntegrationTutorial

MCP (Model Context Protocol) Complete Guide: The New Standard for AI Tool Integration

Everything developers need to know about MCP (Model Context Protocol). Covers what it is, how it works, how to build MCP servers, and why it matters for AI application development.

Feb 23
Claude Code Pricing Guide 2026: Complete BreakdownGuide

Claude Code Pricing Guide 2026: Complete Breakdown

"Complete breakdown of Claude Code pricing in 2026, including Max plan costs, API token usage, and how to save up to 50% with Crazyrouter."

Feb 15
Best OpenRouter Alternative in 2026: A Real Unified AI API Gateway TestComparison

Best OpenRouter Alternative in 2026: A Real Unified AI API Gateway Test

We tested https://cn.crazyrouter.com/v1 as an OpenRouter alternative using /v1/models and six real chat completions across GPT, Gemini, Qwen and OpenAI-compatible routes. Here are the practical migration findings for developers.

Jun 12
Background Coding Agents Without Lock-In: Recreating the Cursor Pattern with Git WorktreesAI Coding

Background Coding Agents Without Lock-In: Recreating the Cursor Pattern with Git Worktrees

Cursor background agents are useful because they move long-running AI coding work out of your local single-branch loop. This guide rebuilds the same workflow pattern with plain git worktrees, task packets, trace logs, and Crazyrouter model routing.

Jun 4
17|Claude Code Integration with Crazyrouter, Part 17: From Idea to AI ProductClaude Code

17|Claude Code Integration with Crazyrouter, Part 17: From Idea to AI Product

17|Claude Code Integration with Crazyrouter, Part 17: From Idea to AI Product. This article covers unified access for Claude Code and Crazyrouter, configuration checks, and hands-on workflows to help readers follow the site docs and build a reusable development workflow.

Jun 10