BLAST (Basic Local Alignment Search Tool) is one of the most widely used bioinformatics applications written in C++, and the version we are using is v2.2.23. `Twister <http://www.iterativemapreduce.org/>`_is an iterative mapreduce framework which can be used both for iterative and non-iterative applications. Twister Blast is an advanced Twister program which helps Blast, a bioinformatics application, utilizes the Computing Capability of Twister. With the flexibility of Twister run-time environment, this application can run on a single machine, a cluster, or Amazon EC2 cloud platform.
Twister-BLAST can divide original query file into small chunks, and distribute them to all available computing nodes. Twister-BLAST manages and schedules Map tasks to process each query chunk based on its location. Output can also be collected by Twister-BLAST. Compared with other parallel BLAST applications, Twister-BLAST is efficient and with little overhead.
You can download the`Twister Blast <http://salsahpc.indiana.edu/tutorial/apps/twister-blast.tar.gz>`_ Source code and customized Blast program and Database archive (BlastProgramAndDB.tar.gz) from Big Data for Science tutorial.
First, Download and unzip the Twister Blast package (named as $TWISTER_BLAST_PROGRAM here), then copy the unzipped $TWISTER_BLAST_PROGRAM/blast/dist/Twister-Blast.jar to the $TWISTER_HOME/apps. Also, we download and unzip the blast program and the database here, and set $BLAST_HOME=/path/to/BlastProgramAndDB/. Go to $TWISTER_BLAST_PROGRAM/blast/bin/, in twister_blast.properties, set the BLAST+ execution command (execmd property) to the BLAST program (blastx) under $BLAST_HOME/bin/. Execution options can be reset according to users’ needs. However, Input option (-query) and output option (-out) are not set in execmd but in inop and outop in order to be compatible with both BLAST+ and BLAST. Twister-BLAST will merge these command options by itself when invoking BLAST+ parallel. The execution command template inside**twister_blast.properties** is given below.
execmd = time /N/u/yangruan/Quarry/workflow/ncbi-blast-2.2.23+/bin/blastp -db /N/dc/scratch/yangruan/blast/db/cog/10k/cog.10000 -evalue 100 -max_target_seqs 1000000 -num_alignments 1000000 -outfmt 6 -seg no
inop = -query
outop = -out
Assume you have already download the input fasta file into some location called [input file path]. Use the $TWISTER_BLAST_PROGRAM/blast/bin/blastNewFileSpliter.sh to split the input fasta file into multiple partitions. The parameters in as following:
args: [query_file] [sequence_count] [num_partition] [data_dir] [output_prefix] [output_map_file]
After deploying those required files onto file system, run the twister-Blast program with the following commands:
./blastNew.sh 128 /N/dc/scratch/yangruan/fasta/cog/10000/400/ input_ .fa 400 /N/dc/scratch/yangruan/blast/result/cog/10k/eval_100_400p/ blastOut_
Here is the description of the above command:
args: [map number] [input folder] [input prefix] [input postfix (None for none)] [partition number] [output folder] [output prefix]
Parameter | Description |
map number | The map task number (usually equals to the number of worker started) |
input folder | The folder of input fasta file partitions |
input prefix | The prefix of input fasta file partitions |
input postfix | The postfix (file extension) of input fasta file partitions (default .fa) |
partition number | The number of input fasta file partitions |
output folder | The folder to store output blast result |
output prefix | The prefix of output blast result |
If Twister Blast is running correctly, it will print twister running messages similar to the following:
./blastNew.sh 128 /N/dc/scratch/yangruan/fasta/cog/10000/400/ input_ .fa 400 /N/dc/scratch/yangruan/blast/result/cog/10k/eval_100_400p/ blastOut_
time /N/u/yangruan/Quarry/workflow/ncbi-blast-2.2.23+/bin/blastp -db /N/dc/scratch/yangruan/blast/db/cog/10k/cog.10000 -evalue 100 -max_target_seqs 1000000 -num_alignments 1000000 -outfmt 6 -seg no
-query
-out
JobID: BlastNewac4d15a9-0997-11e1-81b4-5b7f60de01d2
Nov 7, 2011 11:24:43 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully connected to tcp://149.165.229.100:61616
0 [main] INFO cgl.imr.client.TwisterDriver - MapReduce computation termintated gracefully.
Total Time of BLAST : 28.12Seconds
2 [Thread-1] DEBUG cgl.imr.client.ShutdownHook - Shutting down completed.
After finishing the Job, please use the command to kill the Map-Reduce daemon and broker:
$TWISTER_HOME/bin/stop_twister.sh
‹ Using Twister on FutureGrid up Eucalyptus and Twister on FutureGrid ›