In my previous post, I briefly discussed how advances in sequencing technology have allowed researchers to investigate the correlation between genetic codes and diseases. In this follow-up post, I am going to dive a bit deeper, demonstrating how data science is playing a vital role in understanding how genetic patterns are linked to disease, and in turn, drastically shaping the future of medicine. 

The State of Sequencing Today

As sequencing costs continue to decline, clinicians and researchers are able to perform large cohort studies to profile genetic makeups of many diseases. These diseases often result from mutations in genes that are essential for normal cell functions. Many research studies have focused on identifying the correlation between a mutation profile and patient’s prognoses. By knowing the specific mutations that cause illness, a more personalized treatment plan can be delivered, thereby improving survival rates and reducing medical wastes. This embodies the core idea of precision medicine.

Innovations in precision medicine are particularly relevant in the treatment plans for non-small cell lung cancer (NSCLC). NSCLC, the most common type of lung cancer and one of the leading causes of death in the world, is often tied to mutations in the EFGR, KRAS, and ALK genes. Interestingly, there are many different types of mutations that occur within the EFGR sequence. Some types are treatable with drugs, and some are not. Therefore, it is important for doctors to screen the patients by sequencing. The sequencing is often done by taking a biopsy of patient’s lung tissue. Because a slice of the tissue may contain both normal and tumor cells, a “targeted enrichment” approach is used by zooming into regions of the genome where mutations are most likely to occur.  This ensures that a sufficient number of DNA molecules are surveyed to find the mutations. Comprehensive sequencing panels have been developed and validated clinically for a variety of cancers, and they have become a routine process doctors use to determine the best treatment options for the patients.

How Data Science Can Help

Mutation profiles and patient metadata (e.g., sex, age, race, lifestyle) offer a rich set of information for scientists to model and extrapolate statistical relationships. Pharmaceutical companies can use data science to guide their target research and develop drugs that treat key mutations. Beyond monitoring of disease, sophisticated deep learning algorithms can also be developed to help predict probability of developing the disease. In a recent research article, scientists used AI to predict Alzheimer's disease based on MRI scans years before the symptoms emerge. 

With all the potential data science holds for advancing medicine, we still need to keep several items in mind. First, genomic data is generated from laboratories and sequencing technologies around the globe. Therefore, standardization is crucial to remove artificial batch effects. Second, mutation profiles include beneficial, neutral, and detrimental mutations. Filtering the malignant mutations from the benign mutations will require patient sampling and carefully designed experiments. Third, DNA mutations are not the only cause of disease. Many other types of data, including MRI and CT scans, may provide other valuable information regarding the disease state. With that said, we are at a very exciting phase of big data integration, and data science holds the key to revolutionize the way we look at our health.

Interested in writing for us?
Visit our content contributor page and submit your pitch. 


Joe Liang
Joe Liang

Joe Liang is a synthetic biologist and a data scientist at Synthetic Genomics specializing in gene synthesis and NGS technology development. Previously, he was a clinical research & development scientist at Biological Dynamics. Joe has a PhD in chemical engineering from Caltech.