Use SNAP with MapReduce to parallelize SNAP and make it possible to calculate every human protein mutation.
“MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries.
The framework is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms.” Source: http://en.wikipedia.org/wiki/MapReduce
What have we done to date?
We started with profiling a whole mutation analysis of MT4_HUMAN (62 amino acids * 19 mutations = 1178 Mutations). This took 64 minutes on an 8 core 7GB ram EC2 instance (virtual server provided by Amazon).
95% of the total time was spent in runAll::extractAll (see table 1), this is a prof call for every mutation. The rest of the time is consumed by runAll, but this part can easily be loaded from the existing predictprotein cache.
We sliced the time intensive prof calls out of SNAP and created a MapReduce program that runs on Hadoop (http://hadoop.apache.org/) . We tested our implementation on AWS Elastic MapReduce.
Elastic MapReduce was introduced by Amazon in April 2009. Provisioning of the Hadoop cluster, running and terminating jobs, and handling data transfer between EC2 and S3 are automated by Elastic MapReduce. Source: http://en.wikipedia.org/wiki/Apache_Hadoop#Amazon_Elastic_MapReduce
We created an elastic MapReduce workflow with 3 servers (4 cores, 7.5 GB ram), each one running 10 map processes (see Figure 1). We were able to reduce the processing time of a small protein (MT4_HUMAN) from 60 (High-CPU Extra Large) minutes to 2 minutes.
How we use Hadoop for parallelize Prof
In step (1) we upload a FASTA file with the MT4_HUMAN Sequence into a S3 Amazon Bucket.
The input is:
>sp|P47944|MT4_HUMAN Metallothionein-4 OS=Homo sapiens GN=MT4 PE=2 SV=1 MDPRECVCMSGGICMCGDNCKCTTCNCKTCRKSCCPCCPPGCAKCARGCICKGGSDKCSC CP
The Master Node pulls the FASTA file from S3 and defines n splits for the sequence. These splits are distributed (2) to the different slave nodes.
For example the first split that is transferred from the master to the first slave node looks like this:
HEADER=>sp|P47944|MT4_HUMAN Metallothionein-4 ……
We have written a mapper that executes Prof for the given sequence and range of amino acids. In the example above it would run Prof for all mutations of the first amino acid, resulting in 19 mutations to calculate on one slave node. We can scale this number up and let a mapper work on more mutations. We decided on splitting the sequence this way to further reduce network utilization, we could have also sent all mutated sequences to our map processes.
In step 3 the output of a map process is sent as a key value pair, containing the job name, substitution and the Prof results, to the reducer. The output looks like this:
This result is transferred from the mapper to the reducer. The key has to be unique. The reducer creates a zip file with all rdbProf results and uploads it to S3 (Step 4).
We can easily scale up Prof in a very robust way: We can easily attach further nodes to a running process, remove them again and cleverly handle most failures.
We have been thinking about different approaches. Our first approach was to create a new version of snapfun where every protein is handed over to one SnapMap process. Instead of calling the linear running runAll:extractAll it calls parallel MapReduce Prof program and saves the result to S3. To do the whole mutation analysis we would create a mapper that creates new mappers (see figure 2). This approach would allow us to do the calculations very fast. We can scale our number of running servers depending on the length of the current protein, we can also scale up the number of mappers computing the protein.
If the numbers of mapper were equal to numbers of proteins the whole runtime would only depend on the longest protein.
The problem with this approach is that it is starts an unnecessarily large amount of servers and is therefore very expensive.
The next idea is a more realistic: We have one master node with all proteins that distributes the Jobs straight to the map processes to generate the required Prof files for further computing. After we generate all prof results we move them back to the RostCluster to finish computing the SNAP results with SNAP in predictprotein mode and our generated Prof cache.
Why do we use Hadoop or Amazon?
Hadoop gives us excellent horizontal scaling possibilities and excellent tools to handle distributing the workload, nearly unaffected by any imaginable failures. We can easily loose or replace any amount of slave nodes during computation and even add more to compute faster.
Amazon provides a very intuitive and easy way to use interface and APIs. You only pay for what you need and they have huge amounts of computational power available. We have chosen Amazon because it is the cheapest cloud provider.
This solution scales vertically very well and is only limited by the network traffic.