VirusDetect

>>VirusDetect

    VirusDetect, a bioinformatics pipeline that can efficiently analyze large- scale small RNA (sRNA) datasets for both known and novel virus identification. VirusDetect performs both reference-guided assemblies through aligning sRNA sequences to a curated virus reference database and de novo assemblies of sRNA sequences with automated parameter optimization and the option of host sRNA subtraction. The assembled contigs are compared to a curated and classified reference virus database for known and novel virus identification, and evaluated for their sRNA size profiles to identify novel viruses.

>>VirusDetect online version

VirusDetect online version can be accessed here
Quick Guide for VirusDetect Online

>>VirusDetect standalone program

Installation (standalone program)

System requirement and dependencies

·64-bit Linux system - Mac OS X is not supported
·Perl version 5.10.0 or higher. Perl is installed by default on most Linux systems
·BioPerl version 1.006 or higher. Please check http://www.bioperl.org and wiki/Installing_BioPerlfor more details on installation of BioPerl.
·BWA 0.7.10.Provided in VirusDetect.
·SAMtools v0.1.18.Provided in VirusDetect.
·Velvet v1.1.07.Provided in VirusDetect.
·NCBI BLAST package 2.2.16.Provided in VirusDetect.
·HISAT.(for RNA-Seq datasets)

Installation of VirusDetect is straightforward.Download VirusDetect and unzip the downloaded file.

$ tar -xzvf VirusDetect_v1.7.tar.gz


This will generate a folder named "VirusDetect-v1.7" (we call this folder "VirusDetect home folder"). VirusDetect home folder includes three subfolders, a "bin" folder which contains all executables, a "databases" folder which holds the reference virus sequence and the host genome sequence databases, and a "tools" folder which provides the virus classification script. and the sRNA processing script.The home folder also contains a perl script, virus_detect.pl, which is the core script to run the VirusDetect pipeline.

Run VirusDetect

Quick Start

1,Put sRNA sequence files in fasta or fastq format into VirusDetect home folder
2,Go to VirusDetect home folder and run VirusDetect with the following command

$ perl virus_detect.pl input1 input2 ......


3,The program can take multiple files (input1, input2, ......) as the input and run the files one by one sequentially. The program generates an output folder named such as result_input1 for each input file which contains all the output files.

Input files


VirusDetect takes one or more sequence files in fasta or fastq format as its input.
It's highly recommended to remove ribosomal RNA (rRNA) sequences from the input sequences before running VirusDetect. Users can align sRNA reads to the Silva rRNA database.Here is the command we recommend (assuming the sRNA sequence file is in fasta format):

$ bowtie -v 1 -k 1 --un cleaned_sRNA -f -p 15 Silva_rRNA_database
sRNA_sequences sRNA_rRNA_match


Build virus reference and host genome databases

The virus reference database is available from GenBank (gbvrl). We classified these virus sequences into different kingdoms including plant, vertebrate, invertebrate, fungus, bacteria, algae, archaea and protozoa using the Virus Classification Pipeline we have developed. Unique virus sequence databases were generated for each host kingdom by removing redundant sequences of 100%, 97% and 95% identity, respectively. The classified and non-redundant databases are available on our FTP site. Users can also classify the most recent GenBank virus database using the Virus Classification Pipeline. The classified virus sequence databases (both nucleotides and proteins) need to be properly built before used as the reference (the formatted databases are also provide on our ftp site):

$ bin/bwa index databases/known_virus_reference_nt
$ bin/formatdb -i databases/known_virus_reference_nt -p F
$ bin/formatdb -i databases/known_virus_reference_prot -p T


Note: Reference virus sequences from multiple host kingdoms can be combined and used to identify viruses that may have hosts from different kingdoms.

The host reference sequence database, if available, can be used to subtract sRNA reads derived from the host. The database also needs to be properly built:

$ bin/bwa index databases/name_of_host_reference


For RNA-Seq dataset, HISAT is used to align the reads to the host reference sequence database. The database needs to be indexed:

$ hisat-build databases/name_of_host_reference output


Note: The virus reference and the host sequence databases must be put in the "databases" folder under the "VirusDetect home folder". A curated non-redundant plant virus sequence database (vrl_plant) is provided with the VirusDetect package. Virus sequence databases for other kingdoms can be obtained from our ftp site.

Parameters

$ perl virus_detect.pl --reference [FILE] [options] input1 input2 ......


Section 1: Basic parameters

Header 1Header 2
--reference [String]Name of the reference virus database [vrl_plant]
--host_reference [String]Name of the host reference database used for host sequence subtraction [none]
--thread_num [Integer]Number of CPUs used for alignments [8]


Section 2: BWAalignment parameters (alignments of reads to reference viruses or host sequences)

Header 1Header 2
--max_dist [Integer] Maximum edit distance [1]
--max_open [Integer] Maximum number of gap opens [1]
--max_extension [Integer] Maximum number of gap extensions [1]
--len_seed [Integer] Seed length [15]
--dist_seed [Integer] Maximum edit distance in the seed [1]


Section 3: HISAT options (align RNA-Seq reads to host references)

Header 1Header 2
--hisat_dist [Integer] Maximum edit distance for HISAT [5]


Section 4: blast alignment options (remove redundancy within virus contigs)

Header 1Header 2
--min_overlap[Integer] Minimum overlap length [30]
--max_end_clip[Integer] Maximum length of end clips [6]
--min_identify[Integer] Penalty score for a nucleotide mismatch [-3]
--mis_penalty[Integer] Penalty score for a nucleotide mismatch [-3]
--gap_cost[Integer] Cost to open a gap [-1]
--gap_extension [Integer] Cost to extend a gap [-1]


Section 5: blast alignment options (align virus contigs to virus reference database)

Header 1Header 2
--word_size[Integer] Minimum word size [11]
--exp_value[Float] Maximum e-value [1e-5]
--identity_percent[Float] Minimum percentage identity [25]
--mis_penalty_b[Integer] Penalty score for a nucleotide mismatch [-3]
--gap_cost_b[Integer] Cost to open a gap [-1]
--gap_extension_b [Integer] Cost to extend a gap [-1]


Section 6: result filter options

Header 1Header 2
--hsp_cover[Float] Coverage cutoff of a reported virus contig by reference virus sequences [0.75]
--coverage_cutoff[Float] Coverage cutoff of a reported virus reference sequences by assembled virus contigs [0.1]
--depth_cutoff[Float] Depth cutoff of a reported virus reference [5]
--siRNA_percent[Float] Proportion cutoff of 21-nt and 22-nt siRNAs for viral-like contigs [0.5]

Output files

1,contig_sequences.fa, contig_sequences.blastn.fa, contig_sequences.blastx.fa, and contig_sequences.undetermined.fa

Header 1Header 2
contig_sequences.fa:Sequences of non-redundant contigs derived through reference-guided and de novo assemblies
contig_sequences.blastn.fa:Sequences of contigs that match to virus references by BLASTN
contig_sequences.blastx.fa:Sequences of contigs that match to virus references by BLASTX
contig_sequences.undetermined.fa:Sequences of contigs that do not match to virus references


2,blastn.references.fa and blastx.references.fa


The reference virus sequences that have corresponding aligned non-redundant contigs by BLASTN or BLASTX.

3,blastn.html, blastx.html, undetermine_blast.html and undetermine.html

Header 1Header 2
blastn.html, blastx.html:files listing reference viruses that have corresponding virus contigs identified by BLASTN and BLASTX, respectively
undetermined_blast.html:file listing contigs having hits in the virus reference database but not assigned to any reference viruses (not meet the "--hsp_cover", "--coverage_cutoff", or "--depth_cutoff")
undetermined.html:file listing contigs with no homology in the virus reference database and their siRNA size profiles


4,blastn.sam and blastx.sam


SAM format files containing the alignment information of each contig to its corresponding virus reference sequences. The file can be viewed by TabletIGV, and many others...

5,blastn.xls and blastx.xls


Excel files containing the detailed alignment information between virus contigs to their corresponding virus reference sequences.

Disclaimer

The attached software is provided "as is", without warranty of any kind, express or implied, including but not limited to the fitness for a particular purpose and no infringement. In no event shall Boyce Thompson Institute, ARS/USDA and International Potato Center be liable for any claim, loss, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with this software. Any use of this software out of the context provided will remove Boyce Thompson Institute, ARS/USDA and International Potato Center's connection with the software.



For questions and suggestions, please contact us at bioinfo at cornell.edu