Genomics diagnosis of symptoms and targets the

Genomics is a branch of biology where structural and functional knowledge of genomes is studied. This lecture is mainly focused on Big Data challenges in Genomics, application of Artificial Intelligence and Machine Learning in solving such challenges.The amount of data that we are dealing with in genomics is going to reach a zettabyte by 2020, which is much more than the computing power that we are going to have by then. A normal human contains around 32 billion base pairs, so this means storing this data itself is a big challenge let alone analyzing and sequencing the genes. This gives an insight to the depth of the Big Data challenges in the field of genomics and also why efficient Artificial Intelligence, Machine Learning techniques are needed in this area. So this makes it an interesting research direction nowadays.Artificial Intelligence and Machine Learning are used in various aspects of Genomics such as sequencing, mapping and analyzing structure of RNA, DNA. Identifying genetic variants is an interesting use case of machine learning techniques in genomics. Genetic diseases such as cancer can be cured by properly identifying the genetic variant that is causing cancer, which helps in creating cancer vaccine that modifies the appropriate genome to cure the disease. Another interesting use case is Next Generation Sequencing (NGS). Next generation sequencing is a technique to sequence entire human genome. This enables researchers to study genetics at a level so deep which was never achieved before. That is why it is also called deep DNA sequencing technology. In this method the challenge is to analyze the very long human genome input to find the protein sequence in DNA. Techniques from Artificial Intelligence can be used to efficiently solve this problem.Coming to the achievements that Big Data Analytics brought into this field, the cost of sequencing came down from $100 Million to $1000 in the span of 15 years from 2000 to 2015. Thus there is a lot of research scope in the field of Big Data analytics to solve Genomics.Genome is the unique identity of every living organism and codes the genetic information. The field of genomics aims to diagnose the genome and discover knowledge from its structure and functions. This diagnosis is richer than the physical diagnosis of symptoms and targets the root cause of disease that is generally individualistic. Genomics uses bioinformatics which deals with collecting and analysing complex biological data like genome. This can be made efficient by using big data technologies. Eg: companies like Mayo Clinic have gathered and organized vast amounts of medical data relating to diseases and their symptoms,causes,etc in collaboration with IBM to create a knowledge base for extracting unique insights about diseases by leveraging Big Data. Use cases of Big Data include tasks like variant calling(Process of identifying variants from sequence data), variant prioritization (Identifying disease causing variants out of various possible candidates) and peptide discovery for cancer immunotherapy(Identifying antibodies that bind to the proteins synthesized by cancer cells). It can also be used for disease prevention eg: cardiovascular diseases. When a person experiences a stroke, his dna can be sampled and the genomic variant information that caused the stroke can be identified and used to prevent the risk to other members of the family. Medgenome aims to use state of the art technologies on the genetic data of south asians whose genetic makeup is considered to be pure and develop deep insights into the causes of rare diseases.Challenges:The sampling of genomes has become economically viable nowadays and the entire genome can now be sequenced at the cost of 1000$. This trend shift has caused a rapid increase in the amount of data collected but the corresponding compute infrastructure for storage, compute is becoming a bottleneck to greater knowledge discovery. A lot of work still needs to be done related to the pipeline of execution of big data processing of genomic data such as variant calling. Use of FPGA based hardware accelerators like Dragen processors has reduced the amount of time to sequence the data from 40 hours to 8 hours. Variant calling is a relevant question because once our genome is sampled, the next concern naturally would be “How different is my genome compared to the normally observed genomes?” This knowledge can be used to predict potential disease risks for the particular individual.String matching needs to be solved efficiently. How can we identify a particular sequence of base pairs in a genome? Algorithms like Burrows-wheeler transform have been used successfully to compress the data and indexing the data to find substrings. Suffix trees have also been used for string matching. Bayesian algorithms are being used to identify mutations in genetic sequences. Variant prioritization is another interesting question. “How can we use machine learning to predict the particular variant out of like 3.5 million variants of genomes that is disease causing?” This requires developing clever heuristics based on domain knowledge to eliminate the unnecessary variants and reaching the optimum. Cancer immunotherapy(Immuno-Oncology) is becoming increasingly possible due to big data technology which can perform peptide processing on large scale and discover the right peptide to be injected for immunity form possible tumours.This advance in technology would help us cut down the cost of curing say lung cancer which is very costly (around 150k dollars). Conclusion:Genome based diagnosis can be done from pre-womb to tomb. Knowledge of genomic data ¬†of parents before a baby is conceived can be used to identify potential threats to the baby and help the couple make a decision related to birth planning. The ultimate goal of genomics in healthcare is to realize individualized medicine that is very effective and tries to cure the disease at the root level. Big data technology which is still in its infancy has a huge role to play in this revolution and needs to be made more efficient and scalable.