Over a decade ago, James Baurley, PhD, a software engineer by trade, and mathematician Carolyn Ervin, PhD, founded BioRealm, a genomic and artificial intelligence solutions provider that is helping researchers at institutions like the United States National Institute on Drug Abuse (NIDA) identify, on a large scale, genetic variants that can be linked to addiction. The company’s flagship product, Smokescreen, is being used by clinical researchers and healthcare providers to better understand the genetics of smoking, addiction, and treatment on study participants.
We sat down with Dr. Carolyn Ervin and Dr. James Baurley to talk about how they apply artificial intelligence to the rapidly growing genomic data set collected by Smokescreen users, as well as their thoughts on the ways biostatistics, bioinformatics, and data science will affect the future of healthcare.
What is Smokescreen, and how are researchers and healthcare providers making use of it?
James: Smokescreen is a genotyping array. There are points along the human genome that vary individual to individual and the Smokescreen array is designed to capture that information. In developing it, we really thought of it as a data generation tool; in other words, a standard way of capturing genomic information from a large set of addiction studies.
Basically, the process of using Smokescreen is end to end. Users want to know how genetic variants relate to certain outcomes — but they are starting with millions and millions of variables and narrowing them down to a set of important predictors. These profiles can be used in different settings; for example, to identify smokers who respond well to a certain treatment or smokers who are more addicted.
How does Smokescreen make use of artificial intelligence?
James: Artificial intelligence came into play when we were thinking about how algorithms could be applied to the data. Typical genetic analysis studies look at the points of variation in the human genome one at a time, but these often have very small effects. Instead, we want to consider many different genetic variants all together and see how they relate to addiction. That brought us to AI and algorithms — we worked on categorizing content with the idea that AI algorithms could later use that information in learning.
As more researchers use Smokescreen, you are able to collect more data — much like biotech and genealogy companies 23andMe and Ancestry.com. How is that new data affecting the model?
Carolyn: As we continue to add more data to the model, we are able to develop a better model. That’s what happens with 23andMe or Ancestry.com; they are always widening their base. With Smokescreen, we can also adapt it to different scenarios; for instance, we’re now looking at opioid addictions — working with the dataset from the standpoint of gathering different information from NIDA and putting it into one dataset from opioid studies. The way it is set up, the predictive models that we have will go through a learning process. The more data and the more times they’re run, the more able they are to identify the correct predictors for specific addictions.
James: You can think of the data coming from Smokescreen genotyping as a common language. As more people use the array and share data, these models get better and better.
What are some of the challenges of working with a data set that is increasing in size on a near-constant basis?
Carolyn: Sometimes, I think that if the average person were to look at the amount of material that we go through, they would probably freak out. Some of the documents I deal with are 1,000 pages long. But, I think once you get into the beauty of it — the merging and the harmonization of the data, which is one of the things we're doing with the opioid study — everything meshes really nicely. It's like a big ballroom dance routine where everybody finally gets in step, and that's the beauty of this research: When you get to that point, you don't have to reinvent the wheel — unless, of course, you bring in another study with data that’s a little bit off.
James: One of the primary barriers to using artificial intelligence is it's really hard to prepare the data. All of these algorithms make assumptions about where the data came from and how it's organized. Pretty much any time there's data harmonization or data merging, it's a difficult task.
Carolyn: Also, because biostatistics, bioinformatics, and artificial intelligence are newer sciences, there are a lot of growing pains. What will become the standard that everyone adheres to? It’s 75% there; we see this when we have to adjust merged data.
A common theme across every industry is finding ways to identify insights and take action sooner. In many cases, that means leveraging the tools and expertise of third parties. How are the solutions you’re building making this possible for healthcare providers?
Carolyn: Something like Smokescreen, which takes the burden of identifying genetic variants linked to addiction off of researchers, can ultimately lessen the time it takes a physician to make a better diagnosis, identify the proper prescription, or see positive outcomes in a clinical trial.
One of the things that I've seen working on other biomarker studies that involve merging data is that historically, the markers may change over time as more discoveries are made. There’s an algorithm set up within Smokescreen or within the more common opioid model which we would be able to adapt accordingly and push down the pipeline to the physician, maybe with a new upgrade of software to their app or something, and that would fine tune what the result would be.
I think this is all coming at a good time, because a lot of physicians now are recording most of their data on laptops or they come into the patient’s room with an iPad or other device. If it's adaptable at that level — which it will be — then they will be able to make use of these model outputs in a real-world setting.
About Carolyn M. Ervin, PhD
Carolyn is the cofounder of BioRealm, a genomic and artificial intelligence solutions provider. She earned her doctorate in epidemiology from the University of Southern California, Los Angeles, California; her master's in public health from The University of California, Los Angeles, California; and both her master's in applied mathematics and bachelor's in mathematics from Western Michigan University, Kalamazoo, Michigan.
About James W. Baurley, PhD
James is the cofounder of BioRealm, a genomic and artificial intelligence solutions provider. He earned his doctorate in statistical genetics and genetic epidemiology and his master's in biostatistics from the University of Southern California and his bachelor's in computer science from Clemson University.