.. _s-twister-blast: ********************************************************************** Running Twister Blast on FutureGrid HPC ********************************************************************** .. sidebar:: Page Contents Author: Tak-Lon Stephen Wu This page was original designed by `SalsaHPC `_ group for `Big Data for Science Workshop `_, you can see the original pages `here `_. .. contents:: :local: BLAST (Basic Local Alignment Search Tool) is one of the most widely used bioinformatics applications written in C++, and the version we are using is v2.2.23. `Twister `_is an iterative mapreduce framework which can be used both for iterative and non-iterative applications. Twister Blast is an advanced Twister program which helps Blast, a bioinformatics application, utilizes the Computing Capability of Twister. With the flexibility of Twister run-time environment, this application can run on a single machine, a cluster, or Amazon EC2 cloud platform. Twister-BLAST can divide original query file into small chunks, and distribute them to all available computing nodes. Twister-BLAST manages and schedules Map tasks to process each query chunk based on its location. Output can also be collected by Twister-BLAST. Compared with other parallel BLAST applications, Twister-BLAST is efficient and with little overhead. You can download the`Twister Blast `_ Source code and customized Blast program and Database archive (`BlastProgramAndDB.tar.gz `_) from `Big Data for Science tutorial `_. Requirement ~~~~~~~~~~~ #. Login to FutureGrid Cluster and obtain compute nodes. (`HPC `_/ `OpenStack `_) #. Start Twister on compute nodes. (`SalsaTwister Tutorial `_) #. Download and unzip `Twister Blast `_ Source code. #. Download customized Blast binary and Database archive `BlastProgramAndDB.tar.gz `_ #. Linux command experience. Download and prepare the Twister-Blast ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, Download and unzip the `Twister Blast `_ package (named as $TWISTER\_BLAST\_PROGRAM here), then ​copy the unzipped ​$TWISTER\_BLAST\_PROGRAM/blast/dist/Twister-Blast.jar to the $TWISTER\_HOME/apps. Also, we download and unzip the blast program and the database `here `_, and set $BLAST\_HOME=/path/to/BlastProgramAndDB/. Go to $TWISTER\_BLAST\_PROGRAM/blast/bin/, in **twister\_blast.properties**, set the BLAST+ execution command (execmd property) to the BLAST program (blastx) under $BLAST\_HOME/bin/. Execution options can be reset according to users' needs. However, Input option (-query) and output option (-out) are not set in execmd but in inop and outop in order to be compatible with both BLAST+ and BLAST. Twister-BLAST will merge these command options by itself when invoking BLAST+ parallel. The execution command template inside**twister\_blast.properties** is given below. :: execmd = time /N/u/yangruan/Quarry/workflow/ncbi-blast-2.2.23+/bin/blastp -db /N/dc/scratch/yangruan/blast/db/cog/10k/cog.10000 -evalue 100 -max_target_seqs 1000000 -num_alignments 1000000 -outfmt 6 -seg no inop = -query outop = -out Prepare Twister-Blast input ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Assume you have already download the input fasta file into some location called [input file path]. Use the $TWISTER\_BLAST\_PROGRAM/blast/bin/blastNewFileSpliter.sh to split the input fasta file into multiple partitions. The parameters in as following: :: args:  [query_file] [sequence_count]  [num_partition] [data_dir] [output_prefix] [output_map_file] - query\_file: input fasta file - sequence\_count: sequence count in the input fasta file - num\_partition: number of partitions, this number should be larger or equal to the total worker number started with twister - data\_dir: The output folder of partitioned fasta files - output\_prefix: The output prefix of partitioned fasta files - output\_map\_file: The file contains the information of all the partitions width and height. Execute Twister-Blast ~~~~~~~~~~~~~~~~~~~~~ After deploying those required files onto file system, run the twister-Blast program with the following commands: :: ./blastNew.sh 128 /N/dc/scratch/yangruan/fasta/cog/10000/400/ input_ .fa 400 /N/dc/scratch/yangruan/blast/result/cog/10k/eval_100_400p/ blastOut_ Here is the description of the above command: :: args: [map number] [input folder] [input prefix] [input postfix (None for none)] [partition number] [output folder] [output prefix] +--------------------+-----------------------------------------------------------------------------+ | **Parameter** | **Description** | +--------------------+-----------------------------------------------------------------------------+ | map number | The map task number (usually equals to the number of worker started) | +--------------------+-----------------------------------------------------------------------------+ | input folder | The folder of input fasta file partitions | +--------------------+-----------------------------------------------------------------------------+ | input prefix | The prefix of input fasta file partitions | +--------------------+-----------------------------------------------------------------------------+ | input postfix | The postfix (file extension) of input fasta file partitions (default .fa) | +--------------------+-----------------------------------------------------------------------------+ | partition number | The number of input fasta file partitions | +--------------------+-----------------------------------------------------------------------------+ | output folder | The folder to store output blast result | +--------------------+-----------------------------------------------------------------------------+ | output prefix | The prefix of output blast result | +--------------------+-----------------------------------------------------------------------------+ If Twister Blast is running correctly, it will print twister running messages similar to the following: :: ./blastNew.sh 128 /N/dc/scratch/yangruan/fasta/cog/10000/400/ input_ .fa 400 /N/dc/scratch/yangruan/blast/result/cog/10k/eval_100_400p/ blastOut_ time /N/u/yangruan/Quarry/workflow/ncbi-blast-2.2.23+/bin/blastp -db /N/dc/scratch/yangruan/blast/db/cog/10k/cog.10000 -evalue 100 -max_target_seqs 1000000 -num_alignments 1000000 -outfmt 6 -seg no -query -out JobID: BlastNewac4d15a9-0997-11e1-81b4-5b7f60de01d2 Nov 7, 2011 11:24:43 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect INFO: Successfully connected to tcp://149.165.229.100:61616 0 [main] INFO cgl.imr.client.TwisterDriver - MapReduce computation termintated gracefully. Total Time of BLAST : 28.12Seconds 2 [Thread-1] DEBUG cgl.imr.client.ShutdownHook - Shutting down completed. Finishing the Map-Reduce process ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After finishing the Job, please use the command to kill the Map-Reduce daemon and broker: :: $TWISTER_HOME/bin/stop_twister.sh `‹ Using Twister on FutureGrid `_ `up `_ `Eucalyptus and Twister on FutureGrid › `_