Automated Evidence Generation for Regulatory-Grade Real-World Data
On this page
Executive Summary
The global regulatory landscape for Real-World Evidence (RWE) has reached an inflection point. The FDA’s July 2024 final guidance on electronic health records (EHR) and medical claims data establishes clear expectations for data quality and provenance when using RWD for regulatory submissions.[1] In parallel, the European Medicines Agency (EMA) reported a 47.5% increase in RWD studies conducted through the DARWIN EU network between February 2024 and February 2025.[2] Furthermore, agencies like Japan’s PMDA are increasingly utilizing RWE, particularly for external controls in orphan drug approvals where traditional randomized controlled trials (RCTs) are challenging.[3]
These developments signal a fundamental shift in how regulatory agencies evaluate evidence for drug development and post-market surveillance.
Major Regulatory Developments (2024-2025)
- FDA (United States): Released comprehensive RWD guidance establishing standards for data extraction, validation, and provenance.[1]
- EMA (Europe): DARWIN EU network expanded to 30 data partners across 16 countries, covering approximately 180 million patients.[2]
- PMDA (Japan): Increased acceptance of RWE for external control arms in specific contexts, notably orphan diseases.[3]
Current approaches to RWE generation struggle to meet these evolving requirements. Manual chart abstraction, the dominant method for retrospective and many prospective RWE studies, is associated with significant variability and a pooled error rate of 6.57%.[4] Data aggregators provide scale but often lack the patient-level traceability and depth required by regulators for efficacy endpoints. Point solutions create fragmented workflows that complicate compliance across multiple regulatory frameworks.
This white paper examines the emergence of automated evidence generation platforms, the regulatory frameworks governing their use, and the critical methodologies-including advanced Artificial Intelligence (AI) balanced with human oversight-required to produce regulatory-grade RWE.
The State of Evidence Generation in 2025
The 21st Century Cures Act (2016) established the initial framework for RWE utilization in the United States. Nine years later, pharmaceutical companies still grapple with fundamental challenges in data quality, efficiency, and scalability. While the increasing digitization of healthcare data offers promise, extracting research-relevant information remains a significant bottleneck.
The Cost of Manual Transcription
A major inefficiency in current workflows is the reliance on manual transcription from EHRs or paper sources into Electronic Data Capture (EDC) systems. Studies indicate that up to 70% of data is duplicated between EHR and EDC systems.[5] This duplication is not only time-consuming but also introduces significant errors.
An analysis of data changes in an EDC database found that 71.1% of all modifications were designated as “data entry errors,” primarily simple transcription mistakes.[6] These errors necessitate extensive Source Data Verification (SDV) and query resolution, significantly driving up study costs. On-site monitoring and SDV can account for up to 25% of the total clinical trial budget.[7]
The cost implications are staggering. While the average cost of Phase IV studies varies widely based on scope and complexity (ranging from $5 million to over $20 million), activities related to site management, data management, and monitoring collectively represent approximately 55-60% of the total budget.[8][9]
The Current Landscape: Three Approaches, Three Compromises
| Approach | 1. Traditional CROs/Sites | 2. Data Aggregators | 3. Point Solutions |
|---|---|---|---|
| Promise | Established site relationships and regulatory experience | 330M+ patient records, rapid cohort sizing | Best-in-class features for specific tasks (e.g., ePRO) |
| Data Sources | Site-based EHR, manual abstraction | Tokenized claims, licensed EHR data | Limited to specialty or specific data types |
| Reality |
|
|
Introducing Castor Catalyst: A Unified Approach
Unlike the three traditional approaches that force compromises between depth, scale, and integration, Castor Catalyst provides a unified platform that combines automated evidence generation with complete data access and regulatory-grade traceability.
Castor Catalyst: Key Differentiators
- Complete Data Access: Medical + Pharmacy claims and EHR via patient consent (no tokenization costs)
- Automated Extraction: Average 6 minutes per chart vs. 30 minutes manual[10]
- Direct Access: Patient-consented claims/EHR access, not licensed data
- Visual Audit Trail: Every data point traceable to source document
- Rapid Setup: 4-6 week activation vs. 3-6 months traditional
Patient-Consented Claims Access: A Strategic Shift
Direct Patient Access Advantage
Unlike data aggregators that require expensive tokenization (often adding $200-500K to study costs), Castor Catalyst accesses both medical AND pharmacy claims data, as well as EHR data, directly through patient consent (HIPAA Right of Access). This eliminates tokenization costs and provides complete, traceable data lineage.
The Emergence of Automated Evidence Generation in 2025
The convergence of regulatory acceptance, technological capability (including interoperability standards and AI), and economic pressure has catalyzed the development of automated evidence generation platforms. These systems aim to fundamentally reimagine how clinical evidence is created, validated, and submitted.
Technological Foundations: Interoperability and Data Access
The foundation of automated evidence generation is the ability to access source data programmatically or comprehensively. In the US, this is achieved through two primary pathways: FHIR APIs and HIPAA Release Authorization.
HL7 FHIR and the USCDI
HL7 FHIR (Fast Healthcare Interoperability Resources) has seen rapid adoption, driven by the 21st Century Cures Act. As of 2024, over 80% of US hospitals utilize APIs for patient access in inpatient settings, and 74% use FHIR-based APIs in outpatient settings.[11] FHIR enables near real-time access to data defined by the U.S. Core Data for Interoperability (USCDI).
While FHIR provides speed and lower patient burden (requiring only patient portal login), the data is often limited to structured USCDI elements and may lack the depth (e.g., unstructured notes, detailed encounter context) required for complex RWE studies.
HIPAA Release Authorization
The traditional HIPAA release pathway involves obtaining patient consent and requesting records directly from facilities. This method yields a complete dataset, including unstructured documents (PDFs) and imaging reports, providing the necessary depth for complex studies (e.g., oncology progression, adverse event timelines). While slower (often taking weeks) and incurring higher retrieval costs, it provides superior traceability and audit readiness, as the native source documents are obtained.
Hybrid Data Access Strategy
Castor Catalyst employs a hybrid strategy, leveraging the speed of FHIR for longitudinal registries and less complex disease spaces, while utilizing HIPAA release for studies requiring deep phenotypic data and comprehensive source documentation.
| Capability | HIPAA Release Pathway | FHIR Pathway |
|---|---|---|
| Speed | Slower (~2 weeks) | Fast (Minutes – 24 hrs) |
| Patient Effort | Low (Consent only) | Moderate (Portal login required) |
| Record Depth | Complete (Structured + Unstructured) | Moderate (Primarily structured USCDI) |
| Traceability | Strong (Native files obtained) | Limited (Relies on EMR output) |
| FDA Audit Readiness | Strong | Partial |
The Role of Artificial Intelligence and NLP
Once data is retrieved (either as structured FHIR resources or unstructured PDFs via HIPAA release), Artificial Intelligence, particularly Natural Language Processing (NLP) and Large Language Models (LLMs), is essential for structuring the data for analysis.
AI Performance in Clinical Data Extraction
The performance of AI in extracting clinical data varies significantly depending on the complexity of the task. It is crucial to evaluate performance using metrics beyond simple accuracy, such as Precision, Recall, and F1-score (the harmonic mean of precision and recall).
- Structured Entity Extraction: For well-defined, discrete data points (e.g., age, medication names, specific lab values), modern NLP systems can achieve high performance, with F1-scores often ranging from 0.85 to 0.95.[12][13]
- Complex Concept Extraction and Relation Identification: For complex tasks such as identifying adverse drug events (ADEs), mapping symptoms to standardized terminologies (e.g., MedDRA), or establishing relationships between medications and indications, performance is significantly lower. F1-scores typically range from 0.60 to 0.80.[12][14][15] This highlights the necessity of human oversight for complex data points.
AI Benchmarks, Limitations, and the "Benchmark Fallacy"
The Benchmark Fallacy in Medical AI
The rapid advancement of LLMs has led to impressive performance on standardized medical benchmarks. For example, recent models achieve over 90% accuracy on MedQA (a benchmark based on US Medical Licensing Examination questions).[16]
However, relying solely on these benchmarks to assess suitability for RWE generation is a critical error-the “Benchmark Fallacy.” MedQA and similar tests primarily evaluate knowledge recall and basic reasoning in a multiple-choice format. They do not reflect the complexity of real-world clinical data extraction.[17]
Limitations of Standard Benchmarks for RWE
- Lack of Temporal Reasoning: RWE studies require synthesizing longitudinal data, understanding the sequence of events (e.g., disease progression, treatment switching), which benchmarks do not test.
- Inability to Handle Ambiguity and Conflict: Clinical notes often contain conflicting information, speculation, and nuanced language that multiple-choice tests cannot replicate.
- Metacognition Failure: Studies show that while LLMs perform well on accuracy, they exhibit poor metacognition-they fail to recognize their knowledge limitations and provide confident answers even when information is missing or incorrect.[18]
- Insufficient Difficulty: Many benchmarks are now too easy for advanced models, leading to ceiling effects that mask deficiencies in complex reasoning.[17]
The Relevance of Benchmarks for Data Extraction
Despite these limitations, benchmarks still hold value for specific use cases within the RWE workflow. Automated evidence generation does not require the AI to diagnose patients or recommend treatments. Instead, the primary task is information retrieval and structuring-identifying relevant concepts within unstructured text and mapping them to a standardized format.
In this controlled context, the strong performance of LLMs on knowledge-based benchmarks indicates a robust foundational ability to recognize medical terminology, understand clinical context, and extract discrete data points. While benchmarks are not sufficient proof of regulatory-grade extraction capability, they are a necessary prerequisite.
The Path Forward: Beyond Benchmarks
To truly evaluate AI for RWE, standardized benchmarks must be supplemented with rigorous, task-specific validation against expert-curated clinical datasets that emphasize temporal reasoning, relation extraction, and adverse event identification (e.g., MADE 1.0, n2c2 challenges).[15]
Validating AI and Ensuring Data Provenance
The transition from promising AI technology to regulatory-grade evidence requires rigorous validation, transparent data provenance, and mechanisms to mitigate inherent AI limitations such as hallucination.
Mitigating Hallucination with a Visual Audit Trail
A major concern with LLMs is “hallucination” (confabulation), where the model generates plausible but incorrect information. In RWE, this risk is unacceptable.
Castor Catalyst mitigates this risk through a mandatory Visual Audit Trail. Every data point extracted by the AI is directly linked to its source:
- For Unstructured Data (PDFs): The system highlights the exact bounding box in the source document from which the data was extracted.
- For Structured Data (FHIR): The data point is linked to the specific FHIR resource ID, including retrieval timestamps.
Figure 1: Visual Audit Trail – Every extracted data point is directly linked to its source location in the original document.
This direct, visual linkage ensures complete traceability from the source to the target dataset, allowing for efficient verification and satisfying the data lineage requirements of 21 CFR Part 11.
Human-in-the-Loop (HITL) Review
Given the limitations of current AI in complex temporal reasoning and relation extraction (as discussed in Section 3), a purely automated approach is not sufficient for regulatory-grade RWE.
Figure 2: Automated Evidence Generation Workflow – From patient consent to validated data in EDC.
Castor employs a Human-in-the-Loop (HITL) workflow where medically trained staff (e.g., Registered Nurses, Clinical Data Managers) review the AI-extracted data and the Visual Audit Trail. This review focuses on:
- Verification of Complex Items: Ensuring accuracy for nuanced endpoints like disease progression, adverse events, and treatment rationale.
- Completeness Check: Confirming that all relevant data from the source has been processed.
- Confidence Score Review: Prioritizing review of items where the AI reports lower confidence.
Data is only committed to the EDC after passing this expert review, ensuring the sponsor receives validated, source-verifiable study data.
Data Harmonization and CDISC Standards
To ensure consistency and facilitate analysis, extracted data must be harmonized. Castor Catalyst maps both unstructured and structured source data to the Clinical Data Interchange Standards Consortium (CDISC) standards, specifically C-DASH (Clinical Data Acquisition Standards Harmonization).
The platform utilizes distinct processing pipelines for each CDISC domain (e.g., Adverse Events (AE), Concomitant Medications (CM), Laboratory Tests (LB)). This structured approach ensures higher quality output and accelerates downstream processing and analysis.
International Regulatory Frameworks: A Global Convergence
The harmonization of RWE standards across major regulatory agencies represents a watershed moment for evidence generation. While jurisdictions maintain unique requirements, the convergence on core principles-data quality, traceability, and validation-creates opportunities for unified approaches to global studies.
United States: FDA's Comprehensive Framework
The FDA’s July 2024 final guidance, *Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making*, establishes the most detailed framework for automated evidence generation.[1] The guidance specifically addresses the need for rigorous data capture and validation processes.
It emphasizes that sponsors must ensure the reliability and relevance of the data, including maintaining a clear audit trail documenting data provenance from the source (EHR or claims) through extraction and transformation.
Key FDA Requirements for RWD[1]
- Data Lineage and Provenance: Complete documentation of the origin of the data and the audit trail of data transformations.
- Validation Protocols: Pre-specified processes for data extraction and validation, including handling of automated extraction methods.
- 21 CFR Part 11 Compliance: Ensuring electronic records and signatures are trustworthy, reliable, and equivalent to paper records.
- Data Quality Assurance: Processes to ensure data accuracy, completeness, and consistency.
Europe: DARWIN EU and Data Quality
The European Medicines Agency has taken a distributed approach through DARWIN EU (Data Analysis and Real World Interrogation Network), enabling federated analysis across 180 million patient records while maintaining GDPR compliance.[2]
The EMA’s focus remains heavily on data quality and standardization. The EMA’s guidance emphasizes that the methodologies used for data extraction and processing must be transparent and validated.
“Sponsors should ensure that processes are in place to assure the quality of the data, including their accuracy, completeness, and traceability to the source.”
– Adapted from EMA Guidance on RWD Quality
Operational Implementation & Real-World Evidence
The Automated Evidence Workflow
The operational implementation of Castor Catalyst centers on a streamlined workflow that integrates patient consent, data retrieval, AI processing, and human review.
Figure 2: Automated Evidence Generation Workflow – From patient consent to validated data in EDC.
- Consent and Authorization: The process begins with patient consent (ICF) and authorization (HIPAA Release or FHIR portal access). This establishes the legal foundation for data use.
- Data Retrieval: Utilizing the hybrid approach (FHIR for speed, HIPAA for depth), medical records and claims data are retrieved.
- AI Processing (CoPilot): The AI agent processes the data, structuring it into CDISC C-DASH domains and generating the Visual Audit Trail.
- Clinical Review (HITL): Medically trained Castor staff review the extracted data and audit trail, focusing on complex endpoints and low-confidence items.
- Data Delivery: Validated, structured data is pushed to the EDC or delivered via API.
Efficiency Gains and Cost Reduction
The shift from manual abstraction to automated extraction provides significant efficiency gains and cost reductions.
Reduced Transcription Time and Cost
Internal benchmarks comparing manual abstraction to the Castor Catalyst workflow demonstrate substantial improvements:
- Time Reduction: Time required for extracting a full chart (labs, medications, history) reduced from approximately 30 minutes (manual) to 6 minutes (Catalyst + HITL review).[10]
- Cost Reduction: Direct cost per chart (fully loaded RN salary vs. technology cost + review cost) reduced by approximately 80%, from $48.75 to $9.75.[10]
Impact on Query Rates and SDV
By eliminating transcription errors at the source-which account for over 70% of data changes in EDC[6]-automated extraction significantly reduces the need for downstream query resolution and SDV. Furthermore, research indicates that AI can play a crucial role in data cleaning by detecting anomalies that might be missed by manual review, further reducing data management burden.[19]
Evidence from Castor Catalyst Implementations
Castor Catalyst is actively deployed in several ongoing RWE projects across various therapeutic areas, demonstrating the platform’s capability in real-world settings.
Renovo Therapeutics
Focus: Rare disease registry, longitudinal data collection.
Implementation: Automated extraction of 10+ years of historical data per patient, enabling rapid cohort establishment and identification of prodromal phases.
Omnia Oncology
Focus: Prospective oncology study, integrating EHR and ePRO.
Implementation: Utilizing FHIR-mediated retrieval for continuous data updates aligned with the study visit schedule, processed via Catalyst for near real-time data access.
GLP-1 Observational Study
Focus: Post-market surveillance, large-scale decentralized recruitment.
Implementation: Patient-mediated access to medical and pharmacy claims to track real-world effectiveness and safety signals, bypassing traditional site bottlenecks.
Conclusion and Strategic Implications
The transformation of clinical evidence generation through automation is an operational imperative. As regulatory frameworks mature and technological capabilities advance, the adoption of automated approaches-balanced with rigorous validation and human oversight-is essential for accelerating drug development and ensuring data quality.
The integration of interoperability standards (FHIR), comprehensive data access (HIPAA), and advanced AI, coupled with critical methodologies like the Visual Audit Trail and HITL review, provides a pragmatic path to regulatory-grade RWE.
Key Takeaways
- Regulatory Acceptance: Major agencies (FDA, EMA) have established clear pathways for using RWD, emphasizing data provenance, quality, and validation.[1][2]
- Technical Pragmatism: AI is a powerful tool for data extraction, but its limitations in complex reasoning necessitate Human-in-the-Loop review and transparent audit trails for regulatory use.
- Data Access Strategy: A hybrid approach utilizing both FHIR (for speed) and HIPAA release (for depth) is optimal for diverse RWE needs.
- Economic Imperative: Automated extraction offers significant efficiency gains (approx. 5x faster) and cost reduction (approx. 80% per chart) compared to manual abstraction.[10]
Ready to Transform Your Evidence Generation?
Learn how Castor Catalyst can accelerate your path to regulatory approval with complete medical and pharmacy claims integration.
Frequently Asked Questions (FAQ)
How does Castor Catalyst handle AI hallucinations?
Hallucination is mitigated through the mandatory Visual Audit Trail. Every AI-extracted data point is visually linked to its source (e.g., a highlighted bounding box in a PDF or a FHIR resource ID). This allows human reviewers to verify the extraction instantly, ensuring that no confabulated data enters the final dataset.
Is AI accurate enough for regulatory submissions?
AI performance varies by task. While high F1-scores (0.85-0.95) are achievable for simple entity extraction, complex tasks (e.g., adverse event relation identification) show lower performance (0.60-0.80). Therefore, Catalyst employs a Human-in-the-Loop (HITL) approach where medically trained staff review and validate the AI output before it is committed to the EDC.
How often is the EMR data refreshed?
Data refreshes are typically aligned with the study’s schedule of assessments (e.g., quarterly). The platform does not perform continuous pulls, as there is no automated way to know if new, relevant information is available without refreshing the data.
What happens if the EHR record is incomplete?
If source data is missing (e.g., missing labs, medication start dates), the corresponding fields in the EDC will remain empty. The system accurately reflects the available source data.
How does Catalyst handle unit conversions?
If the units in the EHR do not match the units required by the study protocol/CDMS, the Catalyst AI will attempt to complete a conversion to an available unit of measurement or utilize an ‘other, specify’ option. These conversions are flagged for human review.
What checks does the Castor review team perform?
The medically trained review team confirms two critical aspects: 1) Accuracy: The processed data matches the EHR source via the Visual Audit Trail; and 2) Completeness: All relevant data from the EHR required for the study has been processed.
References
- FDA. (2024, July). Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making for Drug and Biological Products. Final Guidance.
- EMA. (2025, July). DARWIN EU® Third Annual Report (Feb 2024–Feb 2025).
- Hirakawa, A., et al. (2023). Utilization of real-world data as an external control arm in drug development and regulatory review in Japan. Clinical Pharmacology & Therapeutics.
- Garza, J. P., et al. (2024). Error Rates of Data Processing Methods in Clinical Research: A Systematic Review and Meta-Analysis. eGEMs (Generating Evidence & Methods to improve patient outcomes), 12(1). PMC10775420.
- SCRS. (n.d.). EHR to EDC Efforts: One Sponsor’s Real World Experience and Learnings. Retrieved from myscrs.org.
- Ng, C. J., et al. (2013). Evaluation of Data Entry Errors and Data Changes to an Electronic Data Capture Clinical Trial Database. Drug Information Journal, 47(5). PMC3777611.
- Andersen, J. R., et al. (2015). The costs of translating clinical practice into clinical research: the tragedy of the clinical trials commons. British Journal of Clinical Pharmacology.
- Abacum. (2025). Clinical Trial Costing: Phase-By-Phase Budget Guide. Retrieved from abacum.ai.
- HHS ASPE. (n.d.). Examination of Clinical Trial Costs and Barriers for Drug Development.
- Castor. (2025). Internal Benchmarking Data: Castor CoPilot vs. Manual Extraction. [Data on file].
- ONC. (2025, August). Growth of Health IT-Enabled Patient Engagement Capabilities Among U.S. Hospitals: 2021–2024. Data Brief No. 79.
- Liu, K., et al. (2024). Benchmarking performance of GPT models on medical natural language processing tasks. medRxiv. doi:10.1101/2024.06.10.24308699.
- John Snow Labs. (n.d.). AI-Enhanced Oncology Data: Unlocking Insights from EHRs with NLP and LLMs. Retrieved from johnsnowlabs.com.
- Wang, Y., et al. (2021). An Evaluation of Clinical Natural Language Processing Systems to Extract Symptomatic Adverse Events from Patient-Authored Free-Text Narratives. AMIA Annual Symposium Proceedings.
- Jagannatha, A., et al. (2019). Overview of the First Natural Language Processing Challenge for Extracting Medication, Indication, and Adverse Drug Events from Electronic Health Record Notes (MADE 1.0). Drug Safety, 42(1). PMC6860017.
- Vals AI. (2025). MedQA Benchmark. Retrieved from vals.ai/benchmarks/medqa.
- Zuo, X., et al. (2025). MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding. arXiv preprint arXiv:2501.18362.
- Thiriet, C., et al. (2024). Large Language Models lack essential metacognition for reliable medical reasoning. Nature Communications, 15. PMC11733150.
- Jia, Z., et al. (2022). Application of artificial intelligence in data cleaning of electronic data capture system. Chinese Journal of New Drugs.
Related Posts
Implications of Assessing Overall Survival in Oncology Studies
The FDA’s August 2025 draft guidance reshapes oncology clinical trials by requiring pre-specified overall survival
The Patient Experience Paradox: eCOA Strategy Overhaul
Discover how the EMA’s new Patient Experience Data (PED) guidance and the EU HTA Regulation
eConsent Readiness in 33 Countries: Your Global Compliance Roadmap
The definitive whitepaper, “eConsent Readiness in 33 Countries,” provides an essential regulatory update for clinical
To read the rest of this content, please provide a little info about yourself
"*" indicates required fields