FD-NL2SQL | ACL 2026 Demo

Benchmark

Interactive metric view across strategies and models

Use the metric selector to focus on any reported benchmark metric. The table styling preserves the paper's green, red, and underline emphasis so the strongest highlighted values remain easy to scan.

Metric selector

Switch between the reported metrics or view the full table at once.

Full comparison table

Values reproduced from the paper. Metric emphasis mirrors the original coloring and underline markup.

Strategy	Models	chrF	eEM	eF1	AST	Conf	HM
Zero-Shot	Qwen3	81.82	26.90	46.51	72.54	39.01	35.58
	Gemma3	84.07	29.48	39.95	59.41	65.69	40.44
	SQL-R1	63.78	03.32	01.75	41.78	10.39	03.10
	gpt-5n	89.13	11.60	18.87	52.58	--	--
	gpt-5m	94.64	38.33	47.36	51.21	--	--
Few-Shot	Qwen3	83.02	25.60	52.95	80.43	34.70	34.58
	Gemma3	91.32	30.85	45.72	85.47	53.16	41.04
	gpt-5n	89.58	15.80	31.97	75.53	--	--
	gpt-5m	94.52	31.27	46.85	83.03	--	--
CoT	Qwen3	77.96	21.65	29.73	85.21	00.00	00.00
	Gemma3	77.89	13.47	34.88	00.00	00.25	00.73
	SQL-R1	50.24	02.85	03.35	49.15	01.69	02.42
	gpt-5n	86.93	13.67	33.47	77.56	--	--
	gpt-5m	86.28	10.33	19.87	82.02	--	--
Ours	Qwen3	92.02	40.93	55.57	86.36	66.30	52.16
	Gemma3	92.62	32.60	44.48	86.21	70.48	44.55
	gpt-5n	90.89	32.53	50.25	86.32	--	--
	gpt-5m	92.68	39.20	55.85	87.14	--	--

Overview

Helping clinicians query oncology trial databases without writing SQL

The paper frames FD-NL2SQL as a bridge between oncology expertise and structured data access. Instead of forcing clinicians to learn database schemas, the system accepts natural-language questions and returns executable SQL plus the exact rows that were retrieved.

Why this problem matters

Clinical evidence review often depends on ad hoc, multi-constraint questions over cancer type, checkpoint inhibitor class, biomarkers, follow-up windows, endpoints, treatment combinations, and trial phase. Those are difficult to express reliably through keyword search and tedious to encode in SQL.

FD-NL2SQL targets this gap with a domain-aware pipeline built specifically for oncology trial data, where brittle or clinically implausible queries are costly and transparency matters.

Main contributions

Schema-aware, predicate-level decomposition for complex clinical questions.
Retrieval-guided SQL synthesis grounded in expert-verified NL2SQL exemplars.
A feedback loop that grows the exemplar bank through expert approval and safe SQL augmentation.

The interface exposes the decomposition, retrieved exemplars, synthesized SQL, and execution results so users can inspect and refine the whole trace.

Workflow

A five-step pipeline designed to improve with use

The system combines LLM reasoning, retrieval over an evolving exemplar bank, and lightweight execution guards. Each stage is visible in the demo so experts can audit how the final SQL was formed.

Introductory FD-NL2SQL mechanism diagram showing query decomposition, exemplar retrieval, SQL synthesis, expert verification, and logic-based SQL augmentation. — Introductory mechanism diagram from the paper showing how FD-NL2SQL decomposes clinician questions, retrieves expert-approved exemplars, synthesizes final SQL, and grows the exemplar bank through feedback.

01

Schema grounding

FD-NL2SQL first inspects the SQLite schema to collect tables, columns, types, and join keys. That schema context is injected into prompting and reused for post-generation validation.

02

Predicate decomposition and retrieval

A first LLM breaks the user question into WHERE-oriented sub-questions, each representing one atomic predicate. Sentence-BERT retrieval then finds semantically similar expert-approved exemplars for each predicate.

03

Retrieval-guided SQL synthesis

A second LLM synthesizes the final SQL using the original question, the predicate decomposition, the retrieved exemplar bundle, and the schema. The output is constrained to a single executable SELECT or WITH query.

04

Expert approval in the loop

Users can accept, reject, or directly edit the generated SQL. Approved or corrected SQL is written back to the exemplar bank so future retrieval neighborhoods become more clinically aligned.

05

Safe exemplar bank expansion

The paper adds growth through single-step SQL mutations such as operator changes, value edits, or type-compatible column substitutions. Mutations are kept only when they execute successfully and return non-empty results, then back-translated into new question-SQL pairs.

What the demo interface shows

The paper highlights a transparent clinician workflow: natural-language input, decomposition traces, retrieved exemplar evidence, editable SQL, and executed table results in one interface. That trace is a big part of the project's practical value because users can tighten, drop, or revise constraints without starting over.

Results

Benchmark setup and quantitative takeaways

The evaluation starts from 500 Mayo Clinic expert-authored seed questions and expands them with single atomic edits, retaining only read-only SQL that executes successfully and returns non-empty results. After validation and natural-language generation for augmented queries, the paper reports a 1,500-sample NL2SQL benchmark.

Evaluation setup

Seed bank: 500 expert-authored oncology evidence-review questions paired with verified SQLite SQL.
Benchmark growth: projection edits, predicate dropping, and value edits filtered by successful execution.
Baselines: zero-shot, few-shot, chain-of-thought, and SQL-R1 across Qwen3, Gemma3, GPT-5 nano, and GPT-5 mini.
Metrics: exact execution match (eEM), execution F1 (eF1), CHRF, AST similarity, and confidence when available.

Key findings

The paper reports that FD-NL2SQL consistently improves execution-based correctness over prompting-only baselines, with the largest exact-execution gains on Qwen3 and GPT-5 nano.

Qwen3 eEM rises from 26.90 zero-shot and 25.60 few-shot to 40.93 with FD-NL2SQL.
GPT-5 nano eEM rises from 11.60 zero-shot and 15.80 few-shot to 32.53 with FD-NL2SQL.
Gemma3 reaches 92.62 CHRF and 86.21 AST under the proposed method.
The paper argues that decomposition plus exemplar retrieval narrows reliability gaps between backbones.

Reliability

Designed for auditability, safety, and human oversight

The paper emphasizes that reliability in medical-domain NL2SQL is not only about average accuracy. It is also about avoiding silent failures, surfacing evidence, and keeping experts in control of execution.

Execution safeguards

Read-only query policy restricted to executable SELECT or WITH statements.
Schema validation and post-generation checks before execution.
Timeout protection during database execution.
Diagnostic flags for non-read-only SQL, multiple statements, and LIMIT without ORDER BY.

Human approval loop

Experts can accept, edit, or reject generated SQL directly in the interface.
Approved edits are embedded back into the exemplar bank for future retrieval.
Low-confidence or borderline generations can be reviewed with the full decomposition trace.

Ethics and scope

Intended for oncology evidence review and research, not patient-specific diagnosis or treatment decisions.
Described as operating on public Mayo Clinic clinical trial data without patient-level PHI or PII.
Built to augment expert judgment rather than replace it.