Can novel information generated with Language Models fill the gaps of hard data sets?
Well…partially, for now.
Let me explain: as language models (LMs) evolved over time going from simpler count-based to transform-based LMs (e.g. BERT from Google) the question of their deployment in the pharmaceutical R&D, clinical trials, regulatory processes, post-market surveillance arose naturally. The basic idea being, that novel understanding generated by LMs will potentially complement knowledge generated from hard data sets (e.g. omics, bioassays).
How realistic is the idea?
It isn’t if we want to deploy a generalist LM (e.g. trained on Wikipedia) widely across all domains (from language recognition to drug targeting). It could become applicable in the short-term to generate more value, if we focus on specific markets or application verticals. Think target ID for example, which is still a very large vertical (e.g. 40k druggable genes and 42 categories of genes in the Drug Gene Interaction Database). This means that even within target ID we might need to deploy a specific LM trained on purpose to be leveraged in a single target category (e.g. all genes associated with difuse large B Cell lymphoma - from DisGeNET).
Next to being focused enough, the availability of curated, labeled, high quality data on all relevant variables (e.g. all genes and gene products associated with a disease category) is crucial for training. Again, if the target category is small enough, this could be implementable in practice with a proper strategy in place to fill in for missing or not directly comparable data.
A last major point is, that next to wet lab R&D also dry lab (e.g. AI) research faces a reproducibility -, hence a credibility crisis too. Until 2018 only 6% of 400 AI papers included code. This means crucial information is not shared and published results are not comparable.
Conclusion: the pharma & biotech industry needs to focus on specific subdomains (e.g. disease) where a single player can fully control all variables influencing a domain, produce all relevant high-quality curated and labeled data, plus develop the right LM algorithm.
Baby steps…
References: