Research
Benchmark Results
Vitreon Legal's retrieval pipeline is evaluated on independent, public benchmarks. All results are reproducible. We report scores on the same test sets and evaluation protocols as the original benchmark papers.
Pipeline that achieved these scores now powers the production platform.
Try Vitreon Free1.Performance Summary
| Benchmark | Vitreon Score | Published SOTA | Improvement |
|---|---|---|---|
| GaRAGe (ACL 2025) | 0.824 | 0.607 | +36% |
| LEXam Open EN (ICLR 2026) | 0.691 | 0.572 | +21% |
| Legal RAG Bench | 0.860 | — | — |
| ARLC 2026 Warmup | 0.958 | — | 1st / 80 teams |
| ARLC 2026 Finals | 0.719 | — | 4th / 80 teams |
| Citation Coverage | 100% | — | — |
2.GaRAGe Benchmark
RAF Score
0.824
+36% above published SOTA of 0.607
GaRAGe (General-purpose RAG evaluation) is a retrieval-augmented generation benchmark published at ACL 2025 by Amazon Science. It evaluates end-to-end RAG pipelines on their ability to retrieve relevant passages and generate accurate, grounded answers from a heterogeneous document corpus.
The primary metric is the Retrieval Accuracy Factor (RAF), which measures both retrieval precision and answer fidelity. The previously published state-of-the-art score was 0.607. Vitreon Legal's pipeline achieves 0.824 — a 36% improvement.
This score was achieved using Vitreon's production retrieval pipeline without benchmark-specific tuning: hybrid BM25 + vector search with Reciprocal Rank Fusion, asymmetric embedding, and cross-encoder reranking.
3.LEXam Open EN
Score
0.691
+21% above Claude 3.7-S baseline of 0.572
LEXam is a legal examination benchmark published at ICLR 2026 that evaluates AI systems on their ability to answer legal questions drawn from bar exams and professional legal assessments. The “Open EN” variant tests English-language open-ended legal reasoning.
The baseline score of 0.572 was set by Claude 3.7 Sonnet in a direct prompting configuration (no retrieval). Vitreon Legal's retrieval-augmented pipeline achieves 0.691 — demonstrating that source-grounded retrieval significantly improves legal reasoning accuracy compared to pure LLM approaches.
4.Legal RAG Bench
Retrieval Accuracy
0.860
Domain-specific legal retrieval benchmark
Legal RAG Bench is a domain-specific benchmark designed to evaluate retrieval-augmented generation systems on legal document corpora. It tests passage retrieval accuracy, answer grounding fidelity, and citation correctness across a range of legal question types.
Vitreon Legal scores 0.860 on the retrieval accuracy metric, reflecting the effectiveness of the hybrid search architecture and cross-encoder reranking stage when applied to legal-domain documents.
5.ARLC 2026 Competition
Overall Placement
4th / 80 teams
$32K prize pool, Dubai AI Week
The Agentic RAG Legal Challenge (ARLC 2026) was an international legal AI competition organized during Dubai AI Week. 80 teams competed to build the most accurate legal question-answering system over 300+ DIFC (Dubai International Financial Centre) legal documents.
Vitreon Legal competed under the team name “Neon Team”. The competition consisted of a warmup round and a finals round:
- Warmup round: 0.958 — 1st place out of 80 teams.
- Finals round: 0.719 — 4th place overall.
The same retrieval and reasoning pipeline that achieved these competition scores now powers the production Vitreon Legal platform. For the full story of the competition, see our blog post: How Vitreon Placed 4th in ARLC 2026.
6.100% Citation Coverage
Citation Coverage
100%
Every answer cites exact page and clause
Every answer generated by Vitreon Legal includes citations to the exact page, clause, and source document from which the information was retrieved. This is not a statistical average — the system architecturally guarantees that every claim in an answer is traceable to a specific passage in the legal corpus.
Source grounding is enforced at the retrieval stage: the LLM only generates answers from passages that have been explicitly retrieved and verified by the reranking pipeline. No external knowledge or training data is used in the answer generation step.
7.Retrieval Pipeline Methodology
Vitreon Legal's retrieval architecture is a multi-stage pipeline designed for high-precision legal document retrieval:
Stage 1: Hybrid Search
Queries are processed through both BM25 (lexical) and vector search (semantic) in parallel. Results are merged using Reciprocal Rank Fusion (RRF), which combines the strengths of exact keyword matching with semantic similarity.
Stage 2: Asymmetric Embedding
Document passages are embedded using Qwen3-Embedding-8Bwith asymmetric encoding — queries and documents use different embedding prefixes optimized for retrieval rather than similarity. This model was selected after extensive evaluation against multilingual legal corpora.
Stage 3: Cross-Encoder Reranking
Top candidate passages are reranked using a cross-encoder model that jointly encodes the query and each passage. This stage provides the final precision boost that distinguishes relevant passages from merely similar ones.
Stage 4: Grounded Answer Generation
The reranked passages are provided to the LLM with strict instructions to generate answers only from the retrieved context. Every claim must cite the source passage, page number, and clause reference.
8.Legal Corpus
Vitreon Legal indexes legal documents across four jurisdictions:
| Jurisdiction | Court Decisions | Statutes |
|---|---|---|
| Czech Republic | 295,000+ | 6,800+ |
| DIFC (Dubai) | 300+ | Full legislation library |
| United Kingdom | Coverage expanding | Key statutes |
| Australia | Coverage expanding | Key statutes |
9.Competitor Comparison
How Vitreon Legal compares to established Czech legal research platforms:
| Feature | Vitreon Legal | Beck-online | ASPI |
|---|---|---|---|
| AI-powered answers | Yes | No | No |
| Source-grounded citations | 100% coverage | Manual lookup | Manual lookup |
| GaRAGe benchmark | 0.824 (+36% SOTA) | Not tested | Not tested |
| Competition placement | 4th / 80 (ARLC 2026) | Not entered | Not entered |
| Self-serve pricing | From $0/mo | Enterprise sales | Enterprise/institutional |
| Czech court decisions | 295,000+ | Extensive | Extensive |
| Multi-jurisdiction | 4 jurisdictions | Czech/German focus | Czech focus |
| Document upload | Yes (custom corpus) | No | No |
10.Academic References
The benchmarks referenced on this page are from the following peer-reviewed publications:
- GaRAGe: General-purpose RAG Evaluation — ACL 2025 Findings, Amazon Science.
- LEXam: Legal Examination Benchmark — ICLR 2026, OpenReview.
- ARLC 2026: Agentic RAG Legal Challenge — Dubai AI Week 2026, machinescansee.com leaderboard.
Ready to try Vitreon Legal?
See the retrieval pipeline in action. 100% citation coverage, start free.