Tracking 5.3 Billion Mutations: Using MySQL for Genomic Big Data
University of Montreal Tracks Genomic Data With Tokutek’s TokuDB.
Faster insertion rates, improved scalability and agility support lab’s fast growing research database as it grows from 100s of GBs to 1 TB and beyond.
Issue addressed: MySQL database used for genomic research must be able to quickly ingest huge amounts of incoming data – hundreds of thousands of records every day. It also must be able to retrieve data quickly in response to a diverse set of research requests.
Enabling the Hunt for New Cures for Diseases by Seamlessly Processing Billions of Mutations in Epidemiology Records
The Organization: The The Philip Awadalla Laboratory is the Medical and Population Genomics Laboratory at the University of Montreal. Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations. Its research includes work relevant to all types of human diseases: genetic, immunological, infectious, chronic and cancer. Using genomic data from single-nucleotide polymorphisms (SNP), next-generation re-sequencing, and gene expression, along with modern statistical tools, the lab is able to locate genome regions that are associated with disease pathology and virulence as well as study the mechanisms that cause the mutations.
The Challenge: The lab’s genomic research database is following 1400 individuals with 3.7 million shared mutations, which means it is tracking 5.3 billion mutations. Because the representation of genomic sequence is a highly compressible series of letters, the database requires less hardware than a typical one. However, it must be able to store and retrieve data quickly in order to respond to research requests.
Thibault de Malliard, the researcher tasked with managing the lab’s data, adds hundreds of thousands of records every day to the lab’s MySQL database. The database must be able to process the records ASAP so that the researchers can make queries and find information quickly. However, as the database grew to 200 GB, its performance plummeted. de Malliard determined that the database’s MyISAM storage engine was having difficulty keeping up with the fire hose of data, pointing out that a single sequencing batch could take days to run.
Anticipating that the database could grow to 500 GB or even 1 TB within the next year, de Malliard began to search for a storage engine that would maintain performance no matter how large his database got.
“Our database is write-intensive and I need to keep it as up-to-date as possible, so insertion rates are important,” said de Malliard. “Also, researchers have new ideas every day so it’s important to be able to modify the database. With MyISAM we also noted that creating a new index or adding a column led to unacceptable long write locks.”
The Solution: de Malliard decided to benchmark the TokuDB database storage engine from Tokutek against MyISAM. To do so, he ran two autojoined views of MYSQL, one with MyISAM and the other running TokuDB. Tests were performed with a 200 GB table containing 2 billion records. It represents around 1500 samples with 1.3M positions where 1.3M positions is the lab’s current SNP set for CARTaGENE RNAseq. This was all performed on a Centos 5 server running with 48GB of RAM on 6 CPU Intel® Xeon® 2.27Ghz processors.
Insertion Performance: “For us, TokuDB proved to be over 50x faster to add or update data into big tables,” according to de Malliard. “Adding 1M records took 51 min for MyISAM, but 1 min for TokuDB. So inserting one sequencing batch with 48 samples and 1.5M positions would take 2.5 days for MyISAM but one hour with TokuDB.”
Scalability: “A huge amount of data is still manageable with TokuDB, but certainly not with MySQL alone,” according to de Malliard. “Even if we do not update the database often, I believe that TokuDB is the best match because of its insert speed and the fact that there is no index to rebuild. Also, while the lab’s storage also does on-the-fly compression, TokuDB also does a compression of the data. Hence, less data goes through the InfiniBand LAN, and less data is written to the storage.”
Agility: “Any change made to the database structure will lock the table being modified by MyISAM,” claimed de Malliard. “TokuDB allows the lab to write to the database while altering it, a boon for larger databases.”
Advanced Research: “Data management is very important for the genomic research lab. The researchers make a lot of queries, and they want their data at their fingertips. To find the rare record or line, which has not been seen in another DB in the world, can mean the discovery of a new mutation or a gene marker that causes a disease,” noted de Malliard. “With epidemiology data, we are searching to find some state for people who have an issue by comparing a subset of people vs. all the other people. TokuDB uniquely enables us to advance this research.”