Babita Singh,PhD
Senior Researcher, Scientific Advisor, Free(lance) Thinker, Product development, Project management
PhD in Biomedicine, Bioinformatics--My work revolves around scientific research in the area of biological data, genomics of diseases, personalised medicine, and tracing ML/AI progress made in the healthcare sector.--This website documents a summary of all scientific projects I have undertaken since 2013.
Large Language Models for Genomics - Integrating science, signs and symbols
This is Bhopalator: A molecular machine that processes 'cell language'. DNA functions cannot be fully accounted in terms of the laws of physics and chemistry alone, but also as a linguistic system - which should inspire new architectures for LLMs, that of the principles of semiotics - the science of symbols and signs. Bhopalator was proposed as the 'Linguistics of DNA: Words, Sentences, Grammar, Phonetics, and Semantics' by SUNGCHUL JI, (1999).(Work in progress, draft will be available soon)
AI/ML models to efficiently
profile patients with blood-related (Haematological) disorders
First Horizon-2020 European grant awarded in the field of AI in genomics research (GenoMed4All). This study goes beyond current diagnosis approaches and utilizes the power of federated learning, large language models (LLMs) and natural language processing (NLP) to extract valuable insights directly from clinical text reports and provide significantly enhanced patient stratification and outcome prediction in haematological diseases (blood-related disorders).The pilots cover common and rare oncological (Myelodysplastic syndromes and Multiple Myeloma) and non-oncological (Sickle Cell Disease) haematological diseases to stratify patients based on clinical reports, genomic profile, and other multimodal data.Key Steps in Leveraging Generative AI for Genomed4ALL1. Model Adaptation with Domain-Specific Fine-Tuning
: The project employs a pre-trained BERT framework with custom numerical embeddings. Model is fine-tuned using clinical reports from 1,328 hematology patients.
2. Text Embedding and Clustering
: Grouping patients into distinct clusters based on similarities in clinical text.3. Cluster Validation:
Clusters are validated against known patient diagnoses, gene mutation patterns, and survival probabilities (Kaplan-Meier survival analyses).4. Performance Benchmarking:
Testing model performance with general clinical models in metrics like pseudo-perplexity, accuracy, and F1 score.Impact of GenoMed4ALL in Precision MedicineThis study underscored the value of domain-specific adaptations for LLMs in extracting critical features from specialized datasets.
The integration of NLP-driven models into clinical workflows marks the beginning of a new era in personalized medicine. As multimodal data becomes increasingly accessible, combining clinical text, genomics, and imaging with AI may unlock unprecedented capabilities for disease stratification and outcome prediction.
Biological Principles for Safe (Superintelligence) AI : Attention is Nature is all you need
The race is on to develop Superintelligent AI, an artificial intelligence system that will surpass human cognitive abilities across virtually all domains. However, the development of such superintelligence would require fundamentally different approaches and safeguards compared to current AI models and therefore, a wider perspective and vigilantes.Can we count on 'Mother Nature' to provide us with some wisdom? Through millions of years of evolution, Nature has developed numerous strategies for creating intelligent, adaptive, and resilient systems. This paper explores the potential lessons that future superintelligence developers can learn from biological systems to create more responsible and robust artificial intelligence (AI). We examine ten key principles observed in nature and discuss their potential applications in AI development, providing examples from both biological sciences and software engineering perspectives.
A practical guide for federated-learning using multimodal data
Federated learning presents several challenges, specially when applied to multi-modal healthcare data. Here we published some guidelines and good practices for the obstacles faced by research engineers, specially in healthcare sector. This work addresses few critical concerns and how to navigate that, such as regulations dealing with international datasets, interoperability standards for multimodal datasets, maintaining data quality and consistency, mitigate privacy and security concerns, address model complexity and validation, scalability as well as stakeholder engagement.
De-centralised data discovery : Building trust-worthy solutions for healthcare AI
Real-patient data is valuable for the new era of personalized medicine that are utilising AI based tools to train models for an unbiased, precise and faster diagnosis.However, such data is highly identifiable and therefore needs to be protected. Given the absolute necessity of real-patients data for an inclusive & unbiased AI-model training, we cannot simply afford to keep the data locked in either.The Beacon project is a solution that does not compromise on privacy or ownership of the data while simultaneously making such data 'searchable', boosting worldwide research efforts that depend on ‘big data’.This is the first time that the genomics research community (GA4GH & ELIXIR) came together to draft a specification for genomics & clinical data sharing so that it follows a set of rules & principles designed to favour both data owners like patients, clinicians and hospitals, as well as data requesters for example researchers.
Real-time monitoring of virus evolution during COVID-19 pandemic
A tale of quick scientific pandemic response navigating cross-border data exchange, handling terabytes of data on per day basis, developing faster pipelines for real-time information exchange, data visualisation, connecting remote-working teams from around the globe - all with one mission, to quickly trace the ever evolving variants of SARS-CoV-2 virus around the globe.
A pilot-project launched for early diagnosis of rare diseases: Connecting hospitals for rapid data exchange
This is the first pilot-project to test the Beacon v2 API on real-life situation, by connecting different Spanish hospitals together to exchange patient's diagnostics.
Hospitals are increasingly generating patient data through routine clinical practice. However, so far, they couldn't exchange such information with other hospitals for faster diagnosis such as, in case of unknown or rare disease.
This limits important advances and breakthroughs that could be possible through the use of emerging technologies such as AI and machine learning.
Researchers at the European Genome-Phenome Archive (EGA) of the Centre for Genomic Regulation (CRG) and ELIXIR Spain, in collaboration with the Global Alliance for Genomics and Health (GA4GH), came together to address this challenge by releasing Beacon v2, a ‘search engine’ that allows researchers to discover genomic and phenotypic data from patients around the world in a secure and private manner.
Six different hospitals of Catalunya (Spain) were connected in this pilot-project for rapid queries and information exchange.
Computational pipelines for exhaustive pattern search on human DNA
▪ MIRA is a computational pipeline for exhaustive search for enriched mutations on the coding and non-coding regions▪ MoSEA is a python-based tool to perform Motif Enrichment analysis - ie. to search which short sequences (k-mers) of DNA are over-mutated in disease versus control groups.▪ SUPPA is a remarkable tool for fast alternative splicing detection methods for large scale analysis using alignment free mapping, thus exponentially reducing the time required for genomics data analysis.
RNA Biomarkers : Profiling >4000 cancer patients for precise mutational patterns.
This was one of the first extensive research done to study the regulation of alternative splicing through RNA binding proteins, in order to exhaustively search for RNA based biomarkers for early detection of cancer. Together with my lab Computational RNA Biology at University of Pompeu Fabra based in Barcelona, we studied how to utilise the process called 'alternative splicing' to examine distinct mRNA and proteomics-based signatures, especially to identify early tumor subtypes.
The outcomes of these studies were a remarkable feat and were published in major journals, one of them as a cover page article.
* Genome Sequencing and RNA-Motif Analysis Reveal Novel Damaging Noncoding Mutations in Human Tumors. (Cover page issue) Molecular Cancer Research (2018) Read Paper* The role of alternative splicing in cancer. Singh B and Eyras E. Transcription. (2017) Read Paper* SUPPA2 provides fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. Entizne JC, Trincado JL, Hysenaj G, Singh B, Skalic M, Elliott DJ, Eyras E. Biorxiv, (2017) Read Paper* Large-scale analysis of genome and transcriptome alterations in multiple tumors unveils novel cancer-relevant splicing networks. Sebestyén E, Singh B, Miñana B, Pagès A, Mateo F, Pujana MA, Valcárcel J, Eyras E. Genome Research (2016) Read Paper* Argonaute-1 binds transcriptional enhancers and controls constitutive and alternative splicing in human cells. Alló M, Agirre E, Bessonov S, Bertucci P, Gómez Acuña L, Buggiano V, Bellora N, Singh B, et al., Proc Natl Acad Sci USA. (2014) Read Paper
© Research profile - Dr. B. Singh. All rights reserve