Research

Benchmark Results

Vitreon Legal's retrieval pipeline is evaluated on independent, public benchmarks. All results are reproducible. We report scores on the same test sets and evaluation protocols as the original benchmark papers.

Pipeline that achieved these scores now powers the production platform.

Try Vitreon Free

1.Performance Summary

Benchmark	Vitreon Score	Published SOTA	Improvement
GaRAGe (ACL 2025)	0.824	0.607	+36%
LEXam Open EN (ICLR 2026)	0.691	0.572	+21%
Legal RAG Bench	0.860	—	—
ARLC 2026 Warmup	0.920	—	9th / 340 teams
ARLC 2026 Finals	0.719	—	4th / 80 teams
Citation Coverage	100%	—	—

2.GaRAGe Benchmark

RAF Score

0.824

+36% above published SOTA of 0.607

GaRAGe (General-purpose RAG evaluation) is a retrieval-augmented generation benchmark by Sorodoc et al. (Amazon Science), published at ACL 2025 Findings (arXiv 2506.07671). It evaluates end-to-end RAG pipelines on their ability to retrieve relevant passages and generate accurate, grounded answers from a heterogeneous document corpus. Vitreon Legal's score (0.824 RAF) is a pipeline evaluation result on the GaRAGe benchmark and does not constitute a claim of authorship of the original paper.

The primary metric is the Retrieval Accuracy Factor (RAF), which measures both retrieval precision and answer fidelity. The previously published state-of-the-art score was 0.607. Vitreon Legal's pipeline achieves 0.824 — a 36% improvement.

This score was achieved using Vitreon's production retrieval pipeline without benchmark-specific tuning: hybrid BM25 + vector search with Reciprocal Rank Fusion, asymmetric embedding, and cross-encoder reranking.

3.LEXam Open EN

Score

0.691

+21% above Claude 3.7-S baseline of 0.572

LEXam is a legal examination benchmark published at ICLR 2026 that evaluates AI systems on their ability to answer legal questions drawn from bar exams and professional legal assessments. The “Open EN” variant tests English-language open-ended legal reasoning.

The baseline score of 0.572 was set by Claude 3.7 Sonnet in a direct prompting configuration (no retrieval). Vitreon Legal's retrieval-augmented pipeline achieves 0.691 — demonstrating that source-grounded retrieval significantly improves legal reasoning accuracy compared to pure LLM approaches.

4.Legal RAG Bench

Retrieval Accuracy

0.860

Domain-specific legal retrieval benchmark

Legal RAG Bench is a domain-specific benchmark designed to evaluate retrieval-augmented generation systems on legal document corpora. It tests passage retrieval accuracy, answer grounding fidelity, and citation correctness across a range of legal question types.

Vitreon Legal scores 0.860 on the retrieval accuracy metric, reflecting the effectiveness of the hybrid search architecture and cross-encoder reranking stage when applied to legal-domain documents.

5.ARLC 2026 Competition

Overall Placement

4th / 80 teams

$32K prize pool, Dubai AI Week

The Agentic RAG Legal Challenge (ARLC 2026) was an international legal AI competition organized during Dubai AI Week. 340 teams competed to build the most accurate legal question-answering system over 300+ DIFC (Dubai International Financial Centre) legal documents.

Vitreon Legal competed under the team name “Neon Team”. The competition consisted of a warmup round and a finals round:

Warmup round: 0.920 — 9th of 340 teams (G sub-score 0.957). Top 80 advanced to finals.
Finals round: 0.719 — 4th place overall.

The same retrieval and reasoning pipeline that achieved these competition scores now powers the production Vitreon Legal platform. For the full story of the competition, see our blog post: How Vitreon Placed 4th in ARLC 2026.

6.100% Citation Coverage

Citation Coverage

100%

Every answer cites exact page and clause

Every answer generated by Vitreon Legal includes citations to the exact page, clause, and source document from which the information was retrieved. This is not a statistical average — the system architecturally guarantees that every claim in an answer is traceable to a specific passage in the legal corpus.

Source grounding is enforced at the retrieval stage: the LLM only generates answers from passages that have been explicitly retrieved and verified by the reranking pipeline. No external knowledge or training data is used in the answer generation step.

7.Retrieval Pipeline Methodology

Vitreon Legal's retrieval architecture is a multi-stage pipeline designed for high-precision legal document retrieval:

Stage 1: Hybrid Search

Queries are processed through both BM25 (lexical) and vector search (semantic) in parallel. Results are merged using Reciprocal Rank Fusion (RRF), which combines the strengths of exact keyword matching with semantic similarity.

Stage 2: Asymmetric Embedding

Document passages are embedded using Qwen3-Embedding-8Bwith asymmetric encoding — queries and documents use different embedding prefixes optimized for retrieval rather than similarity. This model was selected after extensive evaluation against multilingual legal corpora.

Stage 3: Cross-Encoder Reranking

Top candidate passages are reranked using a cross-encoder model that jointly encodes the query and each passage. This stage provides the final precision boost that distinguishes relevant passages from merely similar ones.

Stage 4: Grounded Answer Generation

The reranked passages are provided to the LLM with strict instructions to generate answers only from the retrieved context. Every claim must cite the source passage, page number, and clause reference.

8.Legal Corpus

Vitreon Legal indexes legal documents across four jurisdictions:

Jurisdiction	Court Decisions	Statutes
Czech Republic	295,000+	6,800+
DIFC (Dubai)	300+	Full legislation library
United Kingdom	Coverage expanding	Key statutes
Australia	Coverage expanding	Key statutes

9.Competitor Comparison

How Vitreon Legal compares to established Czech legal research platforms:

Feature	Vitreon Legal	Beck-online	ASPI
AI-powered answers	Yes	No	No
Source-grounded citations	100% coverage	Manual lookup	Manual lookup
GaRAGe benchmark	0.824 (+36% SOTA)	Not tested	Not tested
Competition placement	4th / 80 (ARLC 2026)	Not entered	Not entered
Self-serve pricing	From $0/mo	Enterprise sales	Enterprise/institutional
Czech court decisions	295,000+	Extensive	Extensive
Multi-jurisdiction	4 jurisdictions	Czech/German focus	Czech focus
Document upload	Yes (custom corpus)	No	No

10.Academic References

The benchmarks referenced on this page are from the following peer-reviewed publications:

GaRAGe: Sorodoc et al., General-purpose RAG Evaluation — ACL 2025 Findings (arXiv 2506.07671), Amazon Science.
LEXam: Legal Examination Benchmark — ICLR 2026, OpenReview.
ARLC 2026: Agentic RAG Legal Challenge — Dubai AI Week 2026, machinescansee.com leaderboard.

Ready to try Vitreon Legal?

See the retrieval pipeline in action. 100% citation coverage, start free.

Get Started Free Try the Demo