VirusDetect
>>VirusDetect
VirusDetect, a bioinformatics pipeline that can efficiently analyze large- scale small RNA (sRNA) datasets for both known and novel virus identification. VirusDetect performs both reference-guided assemblies through aligning sRNA sequences to a curated virus reference database and de novo assemblies of sRNA sequences with automated parameter optimization and the option of host sRNA subtraction. The assembled contigs are compared to a curated and classified reference virus database for known and novel virus identification, and evaluated for their sRNA size profiles to identify novel viruses.
>>VirusDetect online version
VirusDetect online version can be accessed here。
Quick Guide for VirusDetect Online
>>VirusDetect standalone program
Installation (standalone program)
System requirement and dependencies
·64-bit Linux system - Mac OS X is not supported
·Perl version 5.10.0 or higher. Perl is installed by default on most Linux systems
·BioPerl version 1.006 or higher. Please check http://www.bioperl.org and wiki/Installing_BioPerlfor more details on installation of BioPerl.
·BWA 0.7.10.Provided in VirusDetect.
·SAMtools v0.1.18.Provided in VirusDetect.
·Velvet v1.1.07.Provided in VirusDetect.
·NCBI BLAST package 2.2.16.Provided in VirusDetect.
·HISAT.(for RNA-Seq datasets)
Installation of VirusDetect is straightforward.Download VirusDetect and unzip the downloaded file.
$ tar -xzvf VirusDetect_v1.7.tar.gz
This will generate a folder named "VirusDetect-v1.7" (we call this folder "VirusDetect home folder"). VirusDetect home folder includes three subfolders, a "bin" folder which contains all executables, a "databases" folder which holds the reference virus sequence and the host genome sequence databases, and a "tools" folder which provides the virus classification script. and the sRNA processing script.The home folder also contains a perl script, virus_detect.pl, which is the core script to run the VirusDetect pipeline.
Run VirusDetect
Quick Start
1,Put sRNA sequence files in fasta or fastq format into VirusDetect home folder2,Go to VirusDetect home folder and run VirusDetect with the following command
$ perl virus_detect.pl input1 input2 ......
3,The program can take multiple files (input1, input2, ......) as the input and run the files one by one sequentially. The program generates an output folder named such as result_input1 for each input file which contains all the output files.
Input files
VirusDetect takes one or more sequence files in fasta or fastq format as its input.
It's highly recommended to remove ribosomal RNA (rRNA) sequences from the input sequences before running VirusDetect. Users can align sRNA reads to the Silva rRNA database.Here is the command we recommend (assuming the sRNA sequence file is in fasta format):
$ bowtie -v 1 -k 1 --un cleaned_sRNA -f -p 15 Silva_rRNA_database
sRNA_sequences sRNA_rRNA_match
Build virus reference and host genome databases
The virus reference database is available from GenBank (gbvrl). We classified these virus sequences into different kingdoms including plant, vertebrate, invertebrate, fungus, bacteria, algae, archaea and protozoa using the Virus Classification Pipeline we have developed. Unique virus sequence databases were generated for each host kingdom by removing redundant sequences of 100%, 97% and 95% identity, respectively. The classified and non-redundant databases are available on our FTP site. Users can also classify the most recent GenBank virus database using the Virus Classification Pipeline. The classified virus sequence databases (both nucleotides and proteins) need to be properly built before used as the reference (the formatted databases are also provide on our ftp site):$ bin/bwa index databases/known_virus_reference_nt
$ bin/formatdb -i databases/known_virus_reference_nt -p F
$ bin/formatdb -i databases/known_virus_reference_prot -p T
Note: Reference virus sequences from multiple host kingdoms can be combined and used to identify viruses that may have hosts from different kingdoms.
The host reference sequence database, if available, can be used to subtract sRNA reads derived from the host. The database also needs to be properly built:
$ bin/bwa index databases/name_of_host_reference
For RNA-Seq dataset, HISAT is used to align the reads to the host reference sequence database. The database needs to be indexed:
$ hisat-build databases/name_of_host_reference output
Note: The virus reference and the host sequence databases must be put in the "databases" folder under the "VirusDetect home folder". A curated non-redundant plant virus sequence database (vrl_plant) is provided with the VirusDetect package. Virus sequence databases for other kingdoms can be obtained from our ftp site.
Parameters
$ perl virus_detect.pl --reference [FILE] [options] input1 input2 ......
Section 1: Basic parameters
| Header 1 | Header 2 |
|---|---|
| --reference | [String]Name of the reference virus database [vrl_plant] |
| --host_reference | [String]Name of the host reference database used for host sequence subtraction [none] |
| --thread_num | [Integer]Number of CPUs used for alignments [8] |
Section 2: BWAalignment parameters (alignments of reads to reference viruses or host sequences)
| Header 1 | Header 2 |
|---|---|
| --max_dist | [Integer] Maximum edit distance [1] |
| --max_open | [Integer] Maximum number of gap opens [1] |
| --max_extension | [Integer] Maximum number of gap extensions [1] |
| --len_seed | [Integer] Seed length [15] |
| --dist_seed | [Integer] Maximum edit distance in the seed [1] |
Section 3: HISAT options (align RNA-Seq reads to host references)
| Header 1 | Header 2 |
|---|---|
| --hisat_dist | [Integer] Maximum edit distance for HISAT [5] |
Section 4: blast alignment options (remove redundancy within virus contigs)
| Header 1 | Header 2 |
|---|---|
| --min_overlap | [Integer] Minimum overlap length [30] |
| --max_end_clip | [Integer] Maximum length of end clips [6] |
| --min_identify | [Integer] Penalty score for a nucleotide mismatch [-3] |
| --mis_penalty | [Integer] Penalty score for a nucleotide mismatch [-3] |
| --gap_cost | [Integer] Cost to open a gap [-1] |
| --gap_extension | [Integer] Cost to extend a gap [-1] |
Section 5: blast alignment options (align virus contigs to virus reference database)
| Header 1 | Header 2 |
|---|---|
| --word_size | [Integer] Minimum word size [11] |
| --exp_value | [Float] Maximum e-value [1e-5] |
| --identity_percent | [Float] Minimum percentage identity [25] |
| --mis_penalty_b | [Integer] Penalty score for a nucleotide mismatch [-3] |
| --gap_cost_b | [Integer] Cost to open a gap [-1] |
| --gap_extension_b | [Integer] Cost to extend a gap [-1] |
Section 6: result filter options
| Header 1 | Header 2 |
|---|---|
| --hsp_cover | [Float] Coverage cutoff of a reported virus contig by reference virus sequences [0.75] |
| --coverage_cutoff | [Float] Coverage cutoff of a reported virus reference sequences by assembled virus contigs [0.1] |
| --depth_cutoff | [Float] Depth cutoff of a reported virus reference [5] |
| --siRNA_percent | [Float] Proportion cutoff of 21-nt and 22-nt siRNAs for viral-like contigs [0.5] |
Output files
1,contig_sequences.fa, contig_sequences.blastn.fa, contig_sequences.blastx.fa, and contig_sequences.undetermined.fa
| Header 1 | Header 2 |
|---|---|
| contig_sequences.fa: | Sequences of non-redundant contigs derived through reference-guided and de novo assemblies |
| contig_sequences.blastn.fa: | Sequences of contigs that match to virus references by BLASTN |
| contig_sequences.blastx.fa: | Sequences of contigs that match to virus references by BLASTX |
| contig_sequences.undetermined.fa: | Sequences of contigs that do not match to virus references |
2,blastn.references.fa and blastx.references.fa
The reference virus sequences that have corresponding aligned non-redundant contigs by BLASTN or BLASTX.
3,blastn.html, blastx.html, undetermine_blast.html and undetermine.html
| Header 1 | Header 2 |
|---|---|
| blastn.html, blastx.html: | files listing reference viruses that have corresponding virus contigs identified by BLASTN and BLASTX, respectively |
| undetermined_blast.html: | file listing contigs having hits in the virus reference database but not assigned to any reference viruses (not meet the "--hsp_cover", "--coverage_cutoff", or "--depth_cutoff") |
| undetermined.html: | file listing contigs with no homology in the virus reference database and their siRNA size profiles |
4,blastn.sam and blastx.sam
SAM format files containing the alignment information of each contig to its corresponding virus reference sequences. The file can be viewed by Tablet、IGV, and many others...
5,blastn.xls and blastx.xls
Excel files containing the detailed alignment information between virus contigs to their corresponding virus reference sequences.
Disclaimer
The attached software is provided "as is", without warranty of any kind, express or implied, including but not limited to the fitness for a particular purpose and no infringement. In no event shall Boyce Thompson Institute, ARS/USDA and International Potato Center be liable for any claim, loss, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with this software. Any use of this software out of the context provided will remove Boyce Thompson Institute, ARS/USDA and International Potato Center's connection with the software.For questions and suggestions, please contact us at bioinfo at cornell.edu
