Research

Benchmark Results

Vitreon Legal's retrieval pipeline is evaluated on independent, public benchmarks. All results are reproducible. We report scores on the same test sets and evaluation protocols as the original benchmark papers.

Pipeline that achieved these scores now powers the production platform.

Try Vitreon Free

1.Performance Summary

BenchmarkVitreon ScorePublished SOTAImprovement
GaRAGe (ACL 2025)0.8240.607+36%
LEXam Open EN (ICLR 2026)0.6910.572+21%
Legal RAG Bench0.860
ARLC 2026 Warmup0.9581st / 80 teams
ARLC 2026 Finals0.7194th / 80 teams
Citation Coverage100%

2.GaRAGe Benchmark

RAF Score

0.824

+36% above published SOTA of 0.607

GaRAGe (General-purpose RAG evaluation) is a retrieval-augmented generation benchmark published at ACL 2025 by Amazon Science. It evaluates end-to-end RAG pipelines on their ability to retrieve relevant passages and generate accurate, grounded answers from a heterogeneous document corpus.

The primary metric is the Retrieval Accuracy Factor (RAF), which measures both retrieval precision and answer fidelity. The previously published state-of-the-art score was 0.607. Vitreon Legal's pipeline achieves 0.824 — a 36% improvement.

This score was achieved using Vitreon's production retrieval pipeline without benchmark-specific tuning: hybrid BM25 + vector search with Reciprocal Rank Fusion, asymmetric embedding, and cross-encoder reranking.


3.LEXam Open EN

Score

0.691

+21% above Claude 3.7-S baseline of 0.572

LEXam is a legal examination benchmark published at ICLR 2026 that evaluates AI systems on their ability to answer legal questions drawn from bar exams and professional legal assessments. The “Open EN” variant tests English-language open-ended legal reasoning.

The baseline score of 0.572 was set by Claude 3.7 Sonnet in a direct prompting configuration (no retrieval). Vitreon Legal's retrieval-augmented pipeline achieves 0.691 — demonstrating that source-grounded retrieval significantly improves legal reasoning accuracy compared to pure LLM approaches.


4.Legal RAG Bench

Retrieval Accuracy

0.860

Domain-specific legal retrieval benchmark

Legal RAG Bench is a domain-specific benchmark designed to evaluate retrieval-augmented generation systems on legal document corpora. It tests passage retrieval accuracy, answer grounding fidelity, and citation correctness across a range of legal question types.

Vitreon Legal scores 0.860 on the retrieval accuracy metric, reflecting the effectiveness of the hybrid search architecture and cross-encoder reranking stage when applied to legal-domain documents.


5.ARLC 2026 Competition

Overall Placement

4th / 80 teams

$32K prize pool, Dubai AI Week

The Agentic RAG Legal Challenge (ARLC 2026) was an international legal AI competition organized during Dubai AI Week. 80 teams competed to build the most accurate legal question-answering system over 300+ DIFC (Dubai International Financial Centre) legal documents.

Vitreon Legal competed under the team name “Neon Team”. The competition consisted of a warmup round and a finals round:

  • Warmup round: 0.958 — 1st place out of 80 teams.
  • Finals round: 0.719 — 4th place overall.

The same retrieval and reasoning pipeline that achieved these competition scores now powers the production Vitreon Legal platform. For the full story of the competition, see our blog post: How Vitreon Placed 4th in ARLC 2026.


6.100% Citation Coverage

Citation Coverage

100%

Every answer cites exact page and clause

Every answer generated by Vitreon Legal includes citations to the exact page, clause, and source document from which the information was retrieved. This is not a statistical average — the system architecturally guarantees that every claim in an answer is traceable to a specific passage in the legal corpus.

Source grounding is enforced at the retrieval stage: the LLM only generates answers from passages that have been explicitly retrieved and verified by the reranking pipeline. No external knowledge or training data is used in the answer generation step.


7.Retrieval Pipeline Methodology

Vitreon Legal's retrieval architecture is a multi-stage pipeline designed for high-precision legal document retrieval:

Stage 1: Hybrid Search

Queries are processed through both BM25 (lexical) and vector search (semantic) in parallel. Results are merged using Reciprocal Rank Fusion (RRF), which combines the strengths of exact keyword matching with semantic similarity.

Stage 2: Asymmetric Embedding

Document passages are embedded using Qwen3-Embedding-8Bwith asymmetric encoding — queries and documents use different embedding prefixes optimized for retrieval rather than similarity. This model was selected after extensive evaluation against multilingual legal corpora.

Stage 3: Cross-Encoder Reranking

Top candidate passages are reranked using a cross-encoder model that jointly encodes the query and each passage. This stage provides the final precision boost that distinguishes relevant passages from merely similar ones.

Stage 4: Grounded Answer Generation

The reranked passages are provided to the LLM with strict instructions to generate answers only from the retrieved context. Every claim must cite the source passage, page number, and clause reference.


8.Legal Corpus

Vitreon Legal indexes legal documents across four jurisdictions:

JurisdictionCourt DecisionsStatutes
Czech Republic295,000+6,800+
DIFC (Dubai)300+Full legislation library
United KingdomCoverage expandingKey statutes
AustraliaCoverage expandingKey statutes

9.Competitor Comparison

How Vitreon Legal compares to established Czech legal research platforms:

FeatureVitreon LegalBeck-onlineASPI
AI-powered answersYesNoNo
Source-grounded citations100% coverageManual lookupManual lookup
GaRAGe benchmark0.824 (+36% SOTA)Not testedNot tested
Competition placement4th / 80 (ARLC 2026)Not enteredNot entered
Self-serve pricingFrom $0/moEnterprise salesEnterprise/institutional
Czech court decisions295,000+ExtensiveExtensive
Multi-jurisdiction4 jurisdictionsCzech/German focusCzech focus
Document uploadYes (custom corpus)NoNo

10.Academic References

The benchmarks referenced on this page are from the following peer-reviewed publications:

  • GaRAGe: General-purpose RAG Evaluation — ACL 2025 Findings, Amazon Science.
  • LEXam: Legal Examination Benchmark — ICLR 2026, OpenReview.
  • ARLC 2026: Agentic RAG Legal Challenge — Dubai AI Week 2026, machinescansee.com leaderboard.

Ready to try Vitreon Legal?

See the retrieval pipeline in action. 100% citation coverage, start free.