The Cost of Bad Data
How IT-Powered Quality Control is Safeguarding Drug Development
Pablo Yarza, Founder and CEO, BIOINFILE
Poor data quality can cost pharma companies millions—delayed approvals, product recalls, and regulatory non-compliance are just the tip of the iceberg. Yet, many still rely on uncalibrated or generic datasets that fail to deliver precise microbial insights. As digital transformation accelerates, IT-driven solutions are redefining quality control, offering advanced data curation, AI-assisted classification, and seamless regulatory reporting. This article examines how pharma companies can mitigate risks by adopting smarter, continuously updated reference databases and why investing in high quality-controlled infrastructures is not just a regulatory necessity but a competitive advantage.

1. The Obsolescence of Data
Modern pharmaceutical research generates an unprecedented volume of information. Data from genomic sequencing, electronic health records, wearable sensors, and clinical trials is multiplying at a rate that far outpaces traditional management methods.
1.1. A Surge of Biomedical Data Is Coming
Healthcare alone now accounts for roughly 30% of the world’s data volume, and that share is growing quickly. By 2025, Biomedical Data is projected to increase at a 36% compound annual growth rate (CAGR) – faster than data growth in manufacturing, finance, or media. This explosion of data offers immense opportunities for insight but also reveals deep cracks in legacy data systems. As one recent review noted, the “explosive growth of Biomedical Big Data” brings formidable challenges, including highly heterogeneous formats, missing information, and sheer scalability issues. In essence, the surge of new Biomedical Data is rendering old data management practices obsolete, demanding more advanced approaches to avoid drowning in information.
1.2. Risks for the Pharmaceutical Industry
For pharmaceutical companies, poor data quality isn’t just a technical nuisance – it’s a direct threat to their financial, regulatory, and operational well-being. Bad data can derail drug development at any stage. When data integrity is compromised, trial results become unreliable, and regulatory submissions risk rejection. A cautionary example is the case of Zogenix and its epilepsy drug Fintepla. The company faced two FDA application denials in a row – first for submitting an incorrect version of a clinical dataset, and then again because critical trial data were missing. These missteps significantly delayed the drug’s approval, giving competitors a head start and tarnishing the sponsor’s reputation. More broadly, industry analyses warn that a lack of high-quality data can lead to costly delays in drug approvals, hefty compliance fines, or even product recalls. If clinical data are inconsistent or incomplete, companies may have to repeat studies or run additional “rescue” trials, burning through millions of dollars and precious time. In manufacturing, data integrity violations (for example, inaccurate lab results or mis-calibrated equipment logs) can trigger regulatory warning letters and plant shutdowns. Poor data practices thus amplify operational risks – from failed audits and supply chain disruptions to the nightmare of a post-market safety recall. Simply put, bad data is expensive, and in the life-and-death business of pharma, its cost is untenable.
2. The Opportunity to Accelerate R&D
The flip side of this data deluge is a tremendous opportunity: with the right technologies, pharma R&D are breaking data out of silos and transforming raw data into a curated, integrated “R&D knowledge space” that can glean faster insights and make better decisions than ever before. Companies are increasingly looking to information technology and Artificial Intelligence (AI) not just to manage data, but to actively improve its quality and accelerate research workflows.
2.1. Technology Readiness Levels
As pharmaceutical R&D becomes increasingly data-driven, organizations must assess not only new drug candidates but also AI models, automation systems, and digital platforms before integrating them into critical workflows. Technology Readiness Levels (TRLs) offer a structured framework to evaluate the maturity of these innovations, ensuring they are robust enough to support the complexities of drug development. Originally developed by NASA and later embraced by organizations like the European Commission, TRLs offer a systematic approach towards efficient innovation. In Pharma, this means critically de-risking innovations, optimizing investments, and accelerating regulatory approval. For example, an AI-powered drug discovery algorithm may start at TRL 3 (proof-of-concept), where it shows initial promise in academic research. However, before pharma companies integrate it into their R&D pipelines, it must advance through TRL 7-8 (validated prototype in a relevant environment), demonstrating reproducibility, scalability, and compliance with regulatory data integrity standards. By mapping drug development to TRLs, companies can allocate resources wisely along the innovation process, fast-tracking high-maturity technologies that are ready to deliver value, and iterating on those that need more development.

2.2. Machine Learning, Knowledge Graphs and Digital Twins
When two of the game-changing digital technologies available for pharma R&D today, Machine Learning (ML) and Knowledge Graphs (KGs) get combined, they unlock deeper insights into complex biological systems and are revolutionizing how scientists curate and integrate data.
A fascinating array of models exists today, from general OpenAI’s GPT-based models, to the genomic models such as EVO2 (which can generate entire genomes) or the protein structure-predictor AlphaFold. They excel at recognizing patterns in unstructured data and can draft summaries, answer natural language questions, and even predict molecular interactions or safety concerns from early research findings. KGs, on the other hand, provide a structured way to connect disparate data sources – linking genes to diseases, compounds to targets, and trials to outcomes in a network of relationships. They excel at ensuring information is findable and semantically organized across the organization. An interesting use case comes with AI-assisted data curation, as the KG is continuously enriched with the latest information, keeping R&D teams up-to-date. Moreover, LLMs can act as an intuitive interface to complex KGs – researchers can ask questions in plain language and the LLM will traverse the graph to retrieve precise answers. This synergy is making data more accessible: scientists no longer need to be database experts to uncover hidden connections between, say, a genomic mutation and a clinical trial outcome.
Finally, by harnessing Machine Learning and Knowledge Graphs, a new frontier of analysis is opening: the Digital Twins (DTs). These are virtual simulations of complex biological systems used to perform plenty of the modelling required for drug development and personalized medicine. For example, in microbiome research, DTs can simulate how microbial communities evolve in response to different diets or antibiotics. ML models analyze omics data, KGs map microbial-host interactions, and DTs predict long-term health impacts—allowing researchers to develop probiotic interventions before real-world testing.
2.3. New Strategies for Data Curation
In the era of big data, maintaining high-quality, continuously updated datasets requires new thinking and methodologies. Traditional, manual data cleaning is neither scalable nor sufficient. Here, IT-powered strategies come to the forefront. One emerging approach is the adoption of DataOps – a discipline that applies DevOps-style automation and continuous improvement to data pipelines. DataOps frameworks emphasize ongoing data monitoring, validation, and calibration to catch errors or drifts in real time. For example, if an assay instrument starts producing values slightly out of spec, automated DataOps systems can flag the anomaly and initiate recalibration or data correction before bad data cascades into research conclusions. Another important consideration is adhering to FAIR data principles (Findable, Accessible, Interoperable, and Reusable). In practice, this means using standardized formats and ontologies for lab and clinical data, so that every dataset can be easily linked and compared with others. Many pharma organizations are appointing data stewards and establishing data governance councils to enforce such standards and ensure each new batch of data is as clean and compatible as possible with existing knowledge bases. AI itself is now used for quality control: Machine Learning models can learn what “normal” data looks like and automatically detect outliers or inconsistencies (for instance, spotting a patient record that shows an impossible dosage, or a genomic sequence missing key metadata).
Furthermore, continuous curation also involves regularly retraining AI models on updated data. A predictive algorithm in early drug discovery, for example, should be periodically refreshed with the latest experimental results so it doesn’t become stale. As highlighted in emerging Good Machine Learning Practice guidelines, regularly updating models with curated new data helps maintain their predictive power and reliability over time. In sum, the paradigm for data curation in pharma R&D is shifting from one-off cleaning to continuous calibration – an ongoing cycle of validating, integrating, and enriching data to keep it analysis-ready and trustworthy. This is how modern IT-powered quality control is safeguarding drug development: by building a foundation of reliable data, layer by layer, day by day.
3. Sustainable Data Spaces
As data becomes the lifeblood of pharmaceutical R&D, ensuring its long-term availability and usability is a matter of sustainability. It’s not enough to curate data for one project; industry and academia must collaborate to maintain data ecosystems that can support innovation for decades.
3.1. Creating Sustainable Data Spaces: The GBC Case
A forward-thinking example of sustainable data management comes from the world of biodata. The Global Biodata Coalition (GBC) was formed by international research funders in recognition that certain core databases and knowledge repositories are absolutely critical infrastructure for life sciences. Resources such gene databases, protein archives, or clinical trial registries serve millions of users worldwide and underpin countless drug discovery projects. However, they historically faced fragmented, uncertain funding. In 2020, the GBC launched an initiative to coordinate worldwide funding for these core biodata resources, aiming to guarantee their stability and open access. Essentially, GBC members identified a list of Global Core Biodata Resources (GCBRs) – the must-have data repositories that the research community cannot afford to lose – and are jointly ensuring these get the financial support needed to stay updated and freely available.
This is a blueprint for sustainability: treat key data sources as a shared public good and maintain them through collective effort. For the pharmaceutical industry, the GBC case underlines the importance of investing in data infrastructure. A pharma company might generate vast proprietary datasets, but it also relies on public databases for targets, pathways, safety signals, and more. By contributing to and leveraging sustainable data spaces, companies ensure that today’s data treasures don’t become tomorrow’s digital ruins. The GBC’s work shows that proactive governance and cross-sector collaboration can keep crucial data resources alive, curated, and growing – to the benefit of all stakeholders in drug development.
3.2. European Health Data Space: A New Era of Data Harmonization
Another landmark initiative is the European Health Data Space (EHDS), which exemplifies how regional policy can foster sustainable data ecosystems in healthcare and pharma. The EHDS is an upcoming EU-wide framework for health data access and sharing. It is building the rules, standards, and infrastructure needed to harmonize health data across all 27 EU member states.
One major goal is to empower patients with control over their own health records while also enabling those records to travel – so if a patient consents, their data can be used seamlessly by a doctor in another country or by researchers working on a new therapy. From an R&D perspective, the EHDS will create a “single market” for health data, making it easier for pharmaceutical researchers to access large, diverse datasets for drug discovery, pharmacovigilance, and outcome studies (with strong privacy safeguards in place). Crucially, the EHDS addresses the legal and interoperability barriers that often slow down data-driven research. It builds on the foundation of GDPR but goes further in defining health-specific standards and a governance framework for data exchange.
For example, it will standardize formats for electronic health records and establish authorized access mechanisms for secondary use of data (like clinical trial recruitment or public health research). In practice, a company developing a new oncology drug could, in the future, request access to anonymized patient registries from multiple EU countries through the EHDS, instead of negotiating data access country by country. The harmonization of health data on this scale promises not only greater efficiency but also improved data quality – as common standards reduce the inconsistencies that come from patchwork systems. The EHDS is therefore a significant step toward a sustainable, federated data space where high-quality health data is readily available for those with a legitimate need. It embodies the principle that by sharing and standardizing data, we collectively improve the substrate on which all drug development relies.
Final Remarks
The pharmaceutical industry stands at a crossroads where data can either be its greatest asset or its weakest link. As we have seen, calibration is needed at multiple levels. Laboratory instruments require regular calibration to produce accurate measurements, and by extension, digital systems and datasets require continuous calibration to maintain their integrity. Just as a mis-calibrated sensor yields faulty readings, an out-of-date or unvalidated dataset can yield misleading analytics. Experts emphasize that calibration plays a critical role in maintaining data integrity and reliability, underscoring that R&D efforts “heavily rely on accurate and reliable data”. In modern pharma R&D, this idea of calibration transcends the lab bench – it means routinely checking and correcting the course of our data streams, our AI models, and our processes.
Above all, the pursuit of data quality is not a one-time project but a continuous commitment. In an industry where a single data error can set back years of work, building a culture of data excellence is non-negotiable. This involves investing in the right technologies (from LLMs to advanced analytics) and in people and processes (like training staff in data literacy, and establishing clear data governance policies). It means fostering collaboration between IT specialists, data scientists, and domain experts so that quality control is baked into every stage of R&D, from early discovery to clinical trials to manufacturing. The rewards for getting this right are immense: faster development timelines, lower costs, greater regulatory confidence, and ultimately safer, more effective drugs reaching patients. The cost of bad data, conversely, is a price no company wants to pay – it manifests in delays, lost revenue, compliance penalties, and reputational damage that can take years to mend.
In summary, data integrity is the bedrock of innovation in pharmaceuticals. As Biomedical Data continues to skyrocket, companies that proactively implement IT-powered quality control will not only safeguard their drug development programs but also gain a competitive edge. By treating data as a valuable product – one that must be carefully curated, calibrated, and continuously improved – the pharma industry can accelerate R&D while upholding the highest standards of scientific rigor and patient safety. In the digital age of drug development, better data isn’t just an opportunity; it’s our responsibility. Every data point is a potential difference-maker, and ensuring its quality is how we honor the science and the patients we ultimately serve.
References
1. Food and Drug Administration. Data integrity and compliance with drug CGMP: Questions and answers, guidance for industry. U.S. Food and Drug Administration; 2018 Dec. Available from: https://www.fda.gov
2. European Medicines Agency. Guideline on computerized systems and electronic data in clinical trials. European Medicines Agency; 2023. Available from: https://www.ema.europa.eu
3. World Health Organization. WHO guideline on data integrity (Technical Report Series No. 1033, Annex 4). World Health Organization; 2021. Available from: https://www.who.int
4. Global Biodata Coalition. About the Global Biodata Coalition. Available from: https://globalbiodata.org
5. European Commission. European Health Data Space. Available from: https://health.ec.europa.eu
6. McKinsey & Company. Real-world data quality: opportunities and challenges. McKinsey & Company; 2023. Available from: https://www.mckinsey.com
7. Pleiner S, et al. Quality criteria for real-world data in pharmaceutical research and health care decision-making: Austrian expert consensus. JMIR Med Inform. 2022 Jun. Available from: https://pmc.ncbi.nlm.nih.gov
8. Arc Institute. Genome modeling and design across all domains of life with Evo 2. Arc Institute.
9. AlphaFold Protein Structure Database. AlphaFold protein structure database. Available from: https://alphafold.ebi.ac.uk







