In February 2026, the Agentic RAG Legal Challenge (ARLC) was announced as part of Dubai AI Week — an international competition challenging teams to build AI systems capable of answering complex legal questions from over 300 DIFC (Dubai International Financial Centre) legal documents. The prize pool was $32,000. 340 teams from around the world registered.

We entered as team “Neon Team”. This is the story of how we built our system, what worked, what we learned, and how the competition shaped what Vitreon Legal is today.

The Challenge

ARLC 2026 tested something specific: given a corpus of 300+ DIFC legal documents — legislation, regulations, court judgments, and practice directions — build an AI system that can answer legal questions accurately and cite its sources. The evaluation was automated: answers were scored on factual accuracy, completeness, and citation correctness.

This wasn't a general-knowledge quiz. The questions required understanding legal hierarchy (primary vs. secondary legislation), temporal reasoning (which version of a law applies), and cross-reference resolution (one regulation referencing another). Simple RAG wasn't going to cut it.

Our Approach

The core of our system was a multi-stage retrieval pipeline that we had been developing for Czech legal research. We adapted it for the DIFC corpus:

1. Hybrid Search with Reciprocal Rank Fusion

Every query was processed through both BM25 (lexical search) and dense vector search simultaneously. BM25 catches exact legal terminology — section numbers, specific defined terms, case names — that embedding models sometimes miss. Vector search catches semantic similarity when the question uses different words than the source text. Reciprocal Rank Fusion merged the two result sets, giving us the best of both.

2. Asymmetric Embedding

We used asymmetric embedding with different prefixes for queries and documents. This is critical for legal retrieval: a question like “What is the notice period for terminating a tenancy?” should match a passage that says “The landlord shall provide not less than 30 days written notice” — even though the surface-level wording is completely different. Asymmetric encoding optimizes for this query-to-passage matching rather than passage-to-passage similarity.

3. Cross-Encoder Reranking

After the initial retrieval stage returned candidate passages, we ran a cross-encoder model that jointly encoded each (query, passage) pair. This is computationally expensive but dramatically improves precision. In legal retrieval, the difference between a relevant passage and a merely similar one can be a single qualifying clause. The cross-encoder catches these distinctions.

4. Grounded Answer Generation

The top-ranked passages were provided to the LLM with strict instructions: generate answers only from the provided context, cite every claim with the source document and section, and explicitly state when the available documents don't contain an answer. This source-grounding constraint is what gives us 100% citation coverage in production.

Results

The competition had two rounds:

Round	Score	Ranking
Warmup	0.920	9th / 340 teams
Finals	0.719	4th / 80 teams

We scored 0.920 on the warmup round — 9th of 340 teams (G sub-score 0.957). The top 80 teams advanced to the finals. The warmup round tested the core retrieval and answer quality on a set of representative questions with known answers.

In the finals, the questions were significantly harder — requiring multi-document reasoning, temporal analysis across legislative amendments, and synthesis of conflicting provisions. We scored 0.719, placing 4th overall. The gap between warmup and finals performance is instructive: the most challenging legal questions require reasoning across documents, not just better retrieval of single passages.

What We Learned

The competition confirmed several things we had hypothesized but couldn't prove on internal benchmarks alone:

Hybrid search is non-negotiable for legal. Pure vector search misses exact statutory references. Pure BM25 misses semantic intent. You need both, fused properly. Our warmup score of 0.920 would not have been possible with either approach alone.

Cross-encoder reranking is the precision multiplier. The difference between top-10 and top-3 retrieval accuracy is where cross-encoders earn their compute cost. In legal research, returning the second-most-relevant provision instead of the correct one can change the answer entirely.

Source grounding prevents hallucination at scale. By architecturally constraining the LLM to only cite retrieved passages, we achieved 100% citation coverage across all competition submissions. This isn't a prompt trick — it's a system design decision.

Multi-document reasoning is the next frontier. Our warmup-to-finals drop shows where single-hop retrieval hits its limits. The hardest legal questions require chaining information across multiple documents. This is what we're investing in for the next generation of the platform.

From Competition to Product

After ARLC, we took the competition pipeline and scaled it. The DIFC corpus of 300+ documents became a starting point; we added Czech law (295,000+ court decisions, 6,800+ statutes), UK law, and Australian law. The core architecture remained the same: hybrid search, cross-encoder reranking, grounded generation.

The result is Vitreon Legal — a production platform that delivers the same quality of legal research that placed 4th in an international competition, accessible to any legal professional at vitreon.app.

For detailed benchmark numbers across all evaluations, see our Benchmarks page.

Note:Team “Neon Team” on the official ARLC 2026 leaderboard is Vitreon Legal. The team name was chosen before the product name was finalized.

How Vitreon Placed 4th in ARLC 2026