Running Twister Blast on FutureGrid HPC

BLAST (Basic Local Alignment Search Tool) is one of the most widely used bioinformatics applications written in C++, and the version we are using is v2.2.23. `Twister <http://www.iterativemapreduce.org/>`_is an iterative mapreduce framework which can be used both for iterative and non-iterative applications. Twister Blast is an advanced Twister program which helps Blast, a bioinformatics application, utilizes the Computing Capability of Twister. With the flexibility of Twister run-time environment, this application can run on a single machine, a cluster, or Amazon EC2 cloud platform.

Twister-BLAST can divide original query file into small chunks, and distribute them to all available computing nodes. Twister-BLAST manages and schedules Map tasks to process each query chunk based on its location. Output can also be collected by Twister-BLAST. Compared with other parallel BLAST applications, Twister-BLAST is efficient and with little overhead.

You can download the`Twister Blast <http://salsahpc.indiana.edu/tutorial/apps/twister-blast.tar.gz>`_ Source code and customized Blast program and Database archive (BlastProgramAndDB.tar.gz) from Big Data for Science tutorial.

Requirement

  1. Login to FutureGrid Cluster and obtain compute nodes. (HPC/ OpenStack)
  2. Start Twister on compute nodes. (SalsaTwister Tutorial)
  3. Download and unzip Twister Blast Source code.
  4. Download customized Blast binary and Database archive BlastProgramAndDB.tar.gz
  5. Linux command experience.

Download and prepare the Twister-Blast

First, Download and unzip the Twister Blast package (named as $TWISTER_BLAST_PROGRAM here), then ​copy the unzipped ​$TWISTER_BLAST_PROGRAM/blast/dist/Twister-Blast.jar to the $TWISTER_HOME/apps. Also, we download and unzip the blast program and the database here, and set $BLAST_HOME=/path/to/BlastProgramAndDB/. Go to $TWISTER_BLAST_PROGRAM/blast/bin/, in twister_blast.properties, set the BLAST+ execution command (execmd property) to the BLAST program (blastx) under $BLAST_HOME/bin/. Execution options can be reset according to users’ needs. However, Input option (-query) and output option (-out) are not set in execmd but in inop and outop in order to be compatible with both BLAST+ and BLAST. Twister-BLAST will merge these command options by itself when invoking BLAST+ parallel. The execution command template inside**twister_blast.properties** is given below.

execmd = time /N/u/yangruan/Quarry/workflow/ncbi-blast-2.2.23+/bin/blastp -db /N/dc/scratch/yangruan/blast/db/cog/10k/cog.10000 -evalue 100 -max_target_seqs 1000000 -num_alignments 1000000 -outfmt 6 -seg no
inop = -query
outop = -out

Prepare Twister-Blast input

Assume you have already download the input fasta file into some location called [input file path]. Use the $TWISTER_BLAST_PROGRAM/blast/bin/blastNewFileSpliter.sh to split the input fasta file into multiple partitions. The parameters in as following:

args:  [query_file] [sequence_count]  [num_partition] [data_dir] [output_prefix] [output_map_file]
  • query_file: input fasta file
  • sequence_count: sequence count in the input fasta file
  • num_partition: number of partitions, this number should be larger or equal to the total worker number started with twister
  • data_dir: The output folder of partitioned fasta files
  • output_prefix: The output prefix of partitioned fasta files
  • output_map_file: The file contains the information of all the partitions width and height.

Execute Twister-Blast

After deploying those required files onto file system, run the twister-Blast program with the following commands:

./blastNew.sh 128 /N/dc/scratch/yangruan/fasta/cog/10000/400/ input_ .fa 400 /N/dc/scratch/yangruan/blast/result/cog/10k/eval_100_400p/ blastOut_

Here is the description of the above command:

args:  [map number] [input folder] [input prefix] [input postfix (None for none)] [partition number] [output folder] [output prefix]
Parameter Description
map number The map task number (usually equals to the number of worker started)
input folder The folder of input fasta file partitions
input prefix The prefix of input fasta file partitions
input postfix The postfix (file extension) of input fasta file partitions (default .fa)
partition number The number of input fasta file partitions
output folder The folder to store output blast result
output prefix The prefix of output blast result

If Twister Blast is running correctly, it will print twister running messages similar to the following:

./blastNew.sh 128 /N/dc/scratch/yangruan/fasta/cog/10000/400/ input_ .fa 400 /N/dc/scratch/yangruan/blast/result/cog/10k/eval_100_400p/ blastOut_
time /N/u/yangruan/Quarry/workflow/ncbi-blast-2.2.23+/bin/blastp -db /N/dc/scratch/yangruan/blast/db/cog/10k/cog.10000 -evalue 100 -max_target_seqs 1000000 -num_alignments 1000000 -outfmt 6 -seg no
-query
-out
JobID: BlastNewac4d15a9-0997-11e1-81b4-5b7f60de01d2
Nov 7, 2011 11:24:43 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully connected to tcp://149.165.229.100:61616
0    [main] INFO  cgl.imr.client.TwisterDriver  - MapReduce computation termintated gracefully.
Total Time of BLAST : 28.12Seconds
2    [Thread-1] DEBUG cgl.imr.client.ShutdownHook  - Shutting down completed.

Finishing the Map-Reduce process

After finishing the Job, please use the command to kill the Map-Reduce daemon and broker:

$TWISTER_HOME/bin/stop_twister.sh

‹ Using Twister on FutureGrid up Eucalyptus and Twister on FutureGrid ›