Understanding the Human Genome

Ever since the first Human Genome Project was declared complete in 2003, there has been an explosion of interest in understanding the complex information encoded by A, T, C, and G:  the four building blocks of DNA. The human genome is made of three billion permutations of these four letters, which in turn determine how we look and function as living organisms, each of us unique.

Numerous studies have focused on deciphering which DNA combinations make us different from one another. But in spite of our unique qualities, more than 99% of human genomes are actually quite similar, and it is that 1% difference that makes us one-of-a-kind. The differences, or genetic mutations, can be passed down from either parent at birth or developed later due to environmental factors, such as aging or lifestyle. Sometimes, these mutations lead to serious diseases. If we can find out about these mutations early and take preventative action before the onset of disease, our health outcomes can be significantly improved.

In just the past decade, breakthroughs in technology have both accelerated the speed of genome sequencing and driven down the cost from billions to just thousands of dollars per human genome. Companies such as 23andMe and Helix are providing customers with an unprecedented level of detail about their DNA, and at low cost. Simultaneously, interest in precision medicine has exploded as treatment plans tailored to a patient's genetic profile are starting to become a reality. It's a very exciting time to be unlocking our genetic codes.

Where Data Science Fits In

Thanks to this explosion of information, we data scientists can play a vital role in sifting through complex and large data sets to understand the interconnectivity of our genetic codes and human diseases. To do so, we have to address several challenges. First, large cohort studies will have to be carried out to collect data from populations that represent diverse groups of different sexes, ages, races, family health histories, etc. We have to ensure that data collected from different technologies are standardized to remove artificial batch effects, and that we maintain statistical rigor when interpreting and analyzing these studies.

Second, as new technologies and algorithms are being developed, legacy data may have to be reevaluated. Specifically, providing scalable computing power to handle larger datasets will be a critical initiative.

Last but not least, we should never forget that correlation does not imply causation. Data science can help researchers narrow down the list of potential drug candidates, but fundamental bench research is still required to truly understand the underlying mechanisms of these diseases.

Looking Towards the Future 

While our genetic codes are largely determined by our parents, lifestyle choices and our environment also have significant impacts on our long-term health outcomes. With the shift to electronic medical records, more medical data is available than ever before, and researchers are actively developing sophisticated machine learning algorithms to predict diseases. Combining our genetic information and data collected from medical visits and wearable devices is leading to more accurate predictions that we can utilize in order to maximize our health.

Joe Liang
Author
Joe Liang

Joe Liang is a synthetic biologist and a data scientist at Synthetic Genomics specializing in gene synthesis and NGS technology development. Previously, he was a clinical research & development scientist at Biological Dynamics. Joe has a PhD in chemical engineering from Caltech.