Is the industry ready to trust LLMs with chart review?

A scientific evaluation of modern foundation models and the validated frameworks required to deploy them safely in clinical data extraction.

Executive summary

The clinical research industry stands at a decision point with large language models. The technology to extract structured, CDISC-compliant data from unstructured medical records at machine scale now exists, with peer-reviewed validation demonstrating human-level accuracy across oncology, neurology, and ophthalmology. But most deployments fail not because the models are inadequate, but because sponsors treat chart review as a model selection problem rather than a data infrastructure problem. A single off-the-shelf LLM cannot satisfy regulatory requirements for traceability, deterministic output, and human oversight. A validated pipeline combining multi-model consensus, strict extraction guardrails, and mandatory human-in-the-loop review can. The question is not whether to trust LLMs with chart review. The question is whether the deployment framework earns that trust.

The chart review bottleneck

Manual chart reviewThe systematic process of extracting structured clinical data from patient medical records, including physician notes, discharge summaries, surgical pathology reports, and laboratory results, for use in structured research and real-world evidence databases. Chart review underpins retrospective studies, registry-based trials, and post-market surveillance programs where prospective data collection is not feasible. scales linearly with cost. A time-motion study (Arndt et al., 2017) and subsequent EHR workload assessments confirm that a highly trained clinical reviewer takes an average of 39 minutes to fully review a complex patient chart.^[1] The industry is seeing a breaking point in clinical informatics. Hospitals and trial sites sit on mountains of data, but accessing it requires a small army of specialized abstractors.

This is not a data generation problem. It is an extraction problem. Unstructured clinical notes, discharge summaries, and surgical pathology reports hold the absolute richest evidence. Legacy natural language processing could not parse this clinical context. Old systems could spot a keyword like “metastasis” but completely failed to understand the patient timeline. They could not distinguish between a historical family condition, a hypothetical differential diagnosis, or a current active progression.^[2]

The technological landscape has shifted. Large language models are finally emerging as viable technologies to automate and structure this information. The goal is to reach machine scale. Machine scale means processing 10,000 patient charts in the time it takes a human to review 10, while maintaining perfect mapping to an electronic data capture system.

The industry is no longer testing if an AI can read a chart. Researchers are rigorously evaluating if complex AI pipelines can match the reliability of human experts across massive populations. The data proves they can, provided sponsors use the right framework.

Clinical tools vs. research infrastructure

Tools that work flawlessly in a clinical setting do not automatically translate to research. There are ambient scribing solutions that do an excellent job summarizing physician notes for billing. However, clinical research requires strict, deterministic extraction mapped to the Clinical Data Interchange Standards Consortium (CDISC) and its Study Data Tabulation Model (SDTM). A summary does not help a biostatistician. They need structured variables.

Recent frameworks, such as the LLM-powered cross-study harmonization model detailed by Garg et al. (2026), prove that LLMs can successfully map heterogeneous, unstructured clinical trial data into strict CDISC SDTM formats when paired with a rules-based engine.^[3]

Historically, cost and scale acted as massive barriers. Legacy machine learning required hundreds of manually labeled training examples per variable. It was cost-prohibitive to train a custom ML model for a small rare disease registry or a narrow observational study. Modern foundation models A massive artificial intelligence model trained on a vast quantity of unlabeled data at scale. These models can be adapted to a wide range of downstream tasks like reading clinical notes. bypass this training bottleneck, but they still require strict infrastructure to force outputs into compliant formats.

The 80/20 trap and the multi-model pipeline

The industry is navigating a dangerous illusion in clinical AI. Getting a model to extract a primary diagnosis is simple. Getting a model to track complex longitudinal disease progression without hallucinating is exceptionally hard. This phenomenon is known as the 80/20 trap.

The easy 80 percent Modern foundation models A massive artificial intelligence model trained on a vast quantity of unlabeled data at scale. These models can be adapted to a wide range of downstream tasks like reading clinical notes. handle explicit diagnoses with ease. They possess massive context windows. They can ingest hundreds of pages of raw medical text in seconds. If a clinician clearly writes “Patient has Stage III non-small cell lung cancer,” these models will extract that data point perfectly.^[4]

The difficult 20 percent The true return on investment lies in implicit data. Real clinical notes are messy. Doctors use non-standard abbreviations. They copy and paste old notes. A patient’s true status might be buried in a conflicting surgical pathology report. This final 20 percent requires deep clinical reasoning to resolve chronological conflicts and infer missing context.

The proprietary consensus solution Relying on a single, off-the-shelf LLM is a massive regulatory risk. A strong system uses a multi-model consensus approach within a proprietary pipeline. It relies heavily on Retrieval-Augmented Generation (RAG) An AI framework that improves the quality of an LLM’s response by forcing it to retrieve facts from an external, verified database (like a medical coding dictionary) before answering a prompt. to anchor the model’s outputs in factual medical coding standards.^[5]

Strict extraction guardrails

Data extraction must be forced into strict, machine-readable formats. The pipeline must generate a validated JSON or XML output that maps perfectly to an electronic data capture system.^[6] This structured approach prevents the model from injecting creative but factually incorrect assumptions into the clinical database.

The validated science

Sponsors cannot rely on vendor marketing claims to evaluate AI in healthcare. They must look at peer-reviewed validation. The academic literature published between 2025 and 2026 provides undeniable proof that proprietary LLM pipelines match human expertise.

Flatiron Health and the VALID framework Oncology extraction is famously complex because cancer progression is rarely defined by a single structured code. Flatiron Health developed the VALID framework to benchmark AI extractions directly against expert human reviewers. As detailed by Estevez et al. (2026), this authoritative framework rests on three pillars: variable-level metrics against human abstractors, automated verification checks to flag logical inconsistencies, and rigorous replication analyses to validate cohort-level clinical findings.^[7] Their models achieved F1 scoresA statistical measure of a model’s accuracy that combines both precision (how many extracted items were correct) and recall (how many of the total correct items were actually found). comparable to human experts across 14 different cancer types.

Surgical pathology automation A 2025 study published in JMIR Formative Research demonstrated exactly how LLMs automate data extraction from highly complex surgical pathology reports.^[2] The researchers proved that when placed inside proper regulatory safeguards, modern models successfully parsed dense pathological narratives into structured variables.

Multiple Sclerosis structured extraction Complex neurological conditions present unique timeline challenges. A 2026 study in Frontiers in AI tackled the lack of standardization in Multiple Sclerosis outpatient reports.^[5] The researchers developed a consensus approach across multiple models. By cross-validating extractions, they achieved human-level reliability in converting messy retrospective data into a highly structured research format.

Verana Health’s invisible cohort expansion Geographic Atrophy is a late-stage eye disease that historically lacked a specific billing code. Because it was not coded properly, patients were invisible in standard databases. Verana Health deployed multimodal machine learning to analyze both unstructured clinical notes and ophthalmic imaging. By looking for implicit textual signals and visual markers, they expanded their analyzable patient cohort from 330,000 to over 810,000, as presented at the Association for Research in Vision and Ophthalmology (ARVO) annual meeting.^[8] They effectively uncovered half a million hidden patients.

The Castor Catalyst pipeline

Castor Catalyst is a generic, AI-powered clinical data extraction engine. While its capabilities make it an exceptionally good fit for real-world evidence and direct-to-patient workflows, it functions fundamentally as a high-throughput data engine. It efficiently supports site-based workflows, allowing sites to upload PDFs directly or use auto-document uploads for faster data extraction at the site level.

1. Secure data ingestion

The process starts by pulling unstructured data Information that either does not have a pre-defined data model or is not organized in a pre-defined manner. In healthcare, this primarily refers to the free-text narrative notes written by doctors and nurses. directly from electronic medical records. This happens via secure, HIPAA-compliant FHIR APIs and direct site document uploads. All processing occurs within a secure boundary that complies with GDPR and 21 CFR Part 11. No protected health information ever leaks into public model training sets.

2. AI-driven EMR extraction

The engine parses unstructured narrative text and begins mapping the implicit and explicit clinical signals. It extracts these variables and structures them for downstream analysis.

3. Intelligent CRF mapping and rich annotations

Castor Catalyst intelligently maps any extracted data point to a structured Case Report Form (CRF) data structure. Best practices call for rich annotations at the CRF level, such as embedding specific clinical definitions, inclusion constraints, or expected value ranges in the metadata. This rich context maximizes the speed and accuracy of the AI extraction.

4. Confidence scoring and human review

The proprietary pipeline scores extraction confidence. It flags edge cases and low-confidence extractions for a mandatory human-in-the-loop review. This guarantees that complex clinical reasoning is always verified by an expert.

5. Deterministic EDC mapping

The final step forces the data into strict formats. These structured data points flow natively into the Castor electronic data capture (EDC) system, ready for immediate biostatistical analysis.

Scaling evidence generation

Traditional human review is slow and expensive. A time-motion study (Arndt et al., 2017) and subsequent EHR workload assessments confirm that a highly trained clinical reviewer takes an average of 39 minutes to fully review a complex patient chart.^[1] When factoring in specialized labor rates (roughly $1.25 per minute), the cost averages out to $48.75 per individual chart.

The Catalyst pipeline alters this equation. By automating the extraction of unstructured notes, the manual processing time drops from 39 minutes to roughly 6 minutes per chart. This 6-minute window represents the time required for a human expert to quickly verify the AI’s flagged edge cases.

This hybrid approach reduces the per-chart cost from $48.75 down to $9.75. This breaks down to $7.50 for the 6 minutes of human review, plus a $2.25 AI/software processing fee. That is an 80 percent reduction in extraction costs.

This changes financial modeling. Instead of limiting a study to 500 patients due to budget constraints ($24,375), sponsors can analyze 2,500 patients for the exact same cost. They gain statistical power without sacrificing data integrity.

Competitive comparison

Evaluating AI vendors requires looking past the user interface and understanding the underlying data architecture.

Approach	The promise	Data handling	The reality
Legacy EDC transcription	Familiar manual workflows	Sites manually type data from the EMR into the EDC	Creates massive site burden. Expensive, slow, and prone to human transcription errors.
Off-the-shelf LLMs	Incredible reasoning and instant setup	Users paste text into a browser	Massive privacy violations. Models hallucinate data when confused. Output inconsistent text formats that require manual cleaning before database entry.
The Castor Catalyst pipeline	Regulatory-grade evidence generated at machine scale	Secure ingestion mapped through proprietary extraction pipelines and intelligent CRF mapping	Structured data drops directly into a compliant EDC. Sponsors get human-level accuracy with clear audit trails for every extracted variable.

The pipeline approach wins because it treats AI as an infrastructure problem. It combines raw reasoning power with the rigid data constraints required by clinical researchers.

Evaluation criteria

Choosing the right technology partner is a critical decision. Use these specific criteria when evaluating any AI extraction platform. Focus on the workflow, not open-source model claims.

1. Is there a mandatory human in the loop? Vendors claiming complete autonomous accuracy misrepresent the probabilistic nature of LLMs. The FDA’s Draft Guidance on AI in Drug Development relies on a Risk-Based Credibility Assessment Framework. A mandatory human-in-the-loop workflow directly mitigates what the FDA defines as “decision consequence risk” by ensuring final verification rests with a qualified expert.^[9] Castor Catalyst approach: The pipeline uses confidence scoring. If the model is not highly confident in an extraction, it leaves the field blank and flags it for human review.

2. Can the vendor provide objective benchmark data? Organizations cannot submit black-box AI data to a regulator. They must demand proof of accuracy. Castor Catalyst approach: Extractions are benchmarked directly against human gold standards. Every extracted variable includes an audit trail linking back to the source document.

3. Does the system natively connect to an EDC? Extracting data into a spreadsheet creates a data management headache. Analysts still have to map it, clean it, and import it into a compliant database. Castor Catalyst approach: Catalyst is natively integrated with the Castor EDC. The AI pushes structured data directly into the electronic case report forms. It will soon include support for ODM exports.

Common objections and lessons learned

Pre-sales engagements with sponsors have exposed the operational realities of implementing AI extraction. Any vendor offering this technology must have clear answers to these implementation hurdles.

Contracting and data sovereignty Navigating Business Associate Agreements (BAAs) is complex. Sponsors require strict guarantees about where their data lives. Castor EDC utilizes a multi-region architecture. Technical controls pin the data to a specific geography (EU, US, AUS, or UK) ensuring medical research data never crosses unintended borders. Private instances of the technology are also available.

Redaction and privacy Before unstructured notes enter an extraction pipeline, they require de-identification. Patient privacy must be protected at the ingestion layer, stripping out names, exact addresses, and non-essential identifiers while preserving the clinical narrative.

Site contracts and payments Shifting to automated extraction changes how data entry tasks are distributed across a study. When chart review is automated, the manual transcription volume at the site decreases. This is worth addressing in site agreement conversations, but the framing matters. Sites remain essential research partners throughout every study. Chart review, specifically, is rarely a high-value activity for sites, as it is time-consuming and not scientifically engaging for clinical teams. A productive approach is to revisit how agreements account for the changing mix of activities, with the goal of aligning site compensation with the work where their expertise and relationships add the most value. Sponsors that approach this conversation collaboratively tend to see stronger site relationships and more consistent study performance.

A four-step onboarding plan

Moving to an AI-driven pipeline requires collaborative change management as part of broader clinical trial solutions adoption. Customers cannot do this alone. Here is the Castor onboarding process for maximizing extraction quality.

Step 01: Feasibility and source identification

First, the onboarding team works with the sponsor to assess what data can be extracted. The team identifies the available sources and determines if a site-based workflow (e.g., sites uploading source PDFs) or direct data acquisition is the optimal path.

Step 02: Synthetic and test data

Example files are secured from the sponsor. The team uses these to produce synthetic training files or build the initial test dataset. This ensures the engine is calibrated to the exact documentation style of the specific therapeutic area.

Step 03: Benchmarking

Baseline benchmark data is generated. The Castor team works alongside the customer to compare the automated extractions against a human gold standard, proving the F1 scoresA statistical measure of a model’s accuracy that combines both precision (how many extracted items were correct) and recall (how many of the total correct items were actually found). and precision metrics before going live.

Step 04: Workflow optimization

Finally, the workflow is optimized to get the highest possible data extraction quality. Specific input documents are mapped to designated case report forms (CRFs) and the visit structure is refined. This maximizes both quality and overall throughput.

External resources

The environment of clinical AI is moving fast. Bookmark these external resources to stay current on regulatory guidelines and scientific benchmarks.

The VALID Framework Publication: Methodology for benchmarking AI-extracted oncology data against human experts.^[7]
FDA Guidance on AI in Drug Development: The agency’s Risk-Based Credibility Assessment Framework for using AI to support regulatory decision-making.^[9]
EMA Guideline on Registry-Based Studies: The European Medicines Agency framework for using registry data and RWE in regulatory submissions.^[10]
The American Medical Informatics Association (AMIA): A premier community providing training and ethical guidelines for implementing applied informatics in health systems.^[11]

Ready to automate your data extraction?

Stop wasting months on manual data entry. Castor Catalyst combines the raw power of modern AI engines with the strict regulatory compliance of a purpose-built EDC.

Frequently Asked Questions

Does the AI train its models on our patient data?

No. Enterprise AI pipelines operate within secure, isolated environments. While synthetic training files or test datasets may be used temporarily for in-context learning (few-shot prompting) to guide the model’s immediate formatting, the data is never used to train the base-model weights. Patient data is processed in memory to answer a specific prompt and is then immediately discarded, satisfying stringent InfoSec requirements.

How long does it take to set up an extraction pipeline?

Unlike legacy NLP systems that require months of custom rule engineering, modern AI pipelines can be configured quickly. Once your data dictionaries and prompts are mapped, a new study can typically launch in under two weeks.

Can the system read scanned PDF documents?

Yes. Modern multimodal models natively process high-resolution images. They can read scanned faxes, messy handwritten physician notes, and complex laboratory printouts with high accuracy.

What happens if the medical record has conflicting information?

Clinical records frequently contradict themselves. The pipeline handles this through strict prompting instructions. You can instruct the model to always prioritize a surgical pathology report over a general practitioner’s discharge summary when evaluating a specific diagnosis date.

References and regulatory context

Arndt BG, Beasley JW, Watkinson MD, et al. (2017). Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations. Annals of Family Medicine. [PMID: 28893811]
Lee D, Vaid A, Menon K, Freeman R et al. (2025). Using Large Language Models to Automate Data Extraction From Surgical Pathology Reports: Retrospective Cohort Study. JMIR Formative Research. [PMID: 40194317]
Garg et al. (2026). Accelerating Exploratory Clinical Research: An LLM-Powered Framework for Cross-Study Data Harmonization. medRxiv. [Link]
Dao N, Quesada L, Hassan S, Campo M et al. (2025). Generative artificial intelligence for automated data extraction from unstructured medical text. JAMIA Open. [PMID: 40918939]
Poser P, Klimas R, Luerweg J, Reuter E et al. (2026). Improving reliability and accuracy of structured data extraction using a consensus large-language model approach. Frontiers in Artificial Intelligence. [PMID: 41766943]
Park H, Huh J, Chae G, Choi M. (2024). Extraction of clinical data on major pulmonary diseases from unstructured radiologic reports using a large language model. PLoS One. [PMID: 39585830]
Estevez M, Singh N, Dyson L, et al. (2026). Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID). JCO Clinical Cancer Informatics. [DOI: 10.1200/CCI.25.00071]
Verana Health. (2022). Geographic Atrophy Diagnosis in the IRIS Registry: A Comparison Between Images and ICD-10 Codes. ARVO Annual Meeting. [Link]
U.S. Food and Drug Administration (FDA). Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products. [FDA Guidance]
European Medicines Agency (EMA). (2021). Guideline on registry-based studies (EMA/CHMP/EWP/178528/2022). European Medicines Agency. [EMA Guideline]
American Medical Informatics Association (AMIA). Resources and ethical guidelines for applied health informatics. [amia.org]

Glossary of key terms

Chart review The systematic process of extracting structured clinical data from patient medical records, including physician notes, discharge summaries, surgical pathology reports, and laboratory results, for use in structured research and real-world evidence databases. Chart review underpins retrospective studies, registry-based trials, and post-market surveillance programs where prospective data collection is not feasible.

F1 Score A statistical measure of a model’s accuracy that combines both precision (how many extracted items were correct) and recall (how many of the total correct items were actually found).

Foundation Model A massive artificial intelligence model trained on a vast quantity of unlabeled data at scale. These models can be adapted to a wide range of downstream tasks like reading clinical notes.

Retrieval-Augmented Generation (RAG) An AI framework that improves the quality of an LLM’s response by forcing it to retrieve facts from an external, verified database (like a medical coding dictionary) before answering a prompt.

Unstructured data Information that either does not have a pre-defined data model or is not organized in a pre-defined manner. In healthcare, this primarily refers to the free-text narrative notes written by doctors and nurses.

Trust LLMs with Chart Review: Validation Framework