Welcome to SIDR’s documentation!¶
SIDR (Sequence Identification with Decision tRees, pronounced cider) is a tool to filter Next Generation Sequencing (NGS) data based on a chosen target organism. SIDR uses data fron BLAST (or similar classifiers) to train a Decision Tree model to classify sequence data as either belonging to the target organism, or belonging to something else. This classification can be used to filter the data for later assembly.
There are two ways to run SIDR. The first, or default, method takes a number of bioinformatics files as input and calculates relevant statistics from them. The second, or custom, method takes a specifically formatted tab-delimited file with your chosen statistics and uses that directly to train the model.
Documentation¶
Installing SIDR¶
SIDR can be installed either using pip–the python package manager–or manually using the included setup.py file.
Dependencies¶
SIDR is able to install all of its dependencies from PyPI automatically. This should work in most cases. If you are installing under Python 3, you may need to manually install Cython with:
pip install cython
If you have a locally installed version of HTSLib, you can include it by using the commands:
export HTSLIB_LIBRARY_DIR=/usr/local/lib
export HTSLIB_INCLUDE_DIR=/usr/local/include
before installing SIDR.
OSX¶
Users running a recent version of OSX may need to install an alternative python distribution like that provided by Homebrew. If SIDR installation fails with permission errors, this is the most likely solution.
Using a Virtualenv¶
Some cluster users may need to setup a Python virtualenv due to the nature of working with a cluster environment. A virtualenv can be setup with the commands:
virtualenv venv
. venv/bin/activate
If necessary, virtualenv can be installed in the user’s home directory (~/.local/bin must be in $PATH) with the following command:
pip install --user virtualenv
Installing from PyPI¶
Installing from PyPI is the easiest method, and thus the recommended one. To install SIDR:
pip install sidr
Installing from Source with pip¶
Note
When installing from source, setuptools will attempt to contact PyPI to install dependencies. If this is not an option then dependencies will need to be manually installed.
If PyPI is not an option or if you’d like to run the latest development version, SIDR can be installed by running the following command:
pip install git+https://github.com/damurdock/SIDR.git
If you’re installing SIDR in order to develop it, download the source from GitHub and install it by running the following command in the unzipped source directory:
pip install --editable .
Data Preparation¶
In order to run SIDR, you will need to perform several analyses of your data. For the default analysis, you will need:
- A preliminary assembly
- An alignment back to that preliminary assembly
- A BLAST classification of that assembly
- A copy of the NCBI Taxonomy Dump
Alternatively, you can precalculate the data you wish to use to train the model, and save it in a specific format for input. This is explained here: Running With a Runfile.
Assembly¶
SIDR requires a preliminary assembly of your data built with standard de novo assembly techniques. The scaffolds from this assembly will be used as input for the machine learning model. During testing, the ABySS assembler was used to generate preliminary assemblies, however at this time no testing has been done as to the effect the preliminary assembler has on downstream assembly.
Regardless of the tools used, the final scaffold FASTA file will be used for input into SIDR.
Alignment¶
The second piece of data required is an alignment of your raw reads to the preliminary assembly. The alignment can be constructed using any standard alignment tools, during testing GSNAP was used. Regardless of the tools used, the alignment must be in a sorted and indexed BAM file. These can be created from a SAM alignment using the following samtools commands:
samtools view -Sb /path/to/alignment.sam -o /path/to/alignment.bam
samtools sort /path/to/alignment.bam /path/to/alignment_sorted
samtools index /path/to/alignment_sorted.bam
BLAST¶
The last piece of data that must be precalculated is a BLAST classification of the preliminary assembly. This may be constructed with a tool besides command-line BLAST, so long as it is properly formatted. To make a properly-formatted BLAST result file, you can use the command:
blastn \
-task megablast \
-query /path/to/FASTA \
-db nt \
-outfmt '6 qseqid staxids' \
-culling_limit 5 \
-evalue 1e-25 \
-out /path/to/output
Currently SIDR assumes that BLAST input will have the sequence ID in the first column, and the NCBI Taxonomy ID in the second column. Any alternative classification tool may be used so long as it can produce this output. Any additional columns in the BLAST output will be ignored.
Taxonomy Dump¶
SIDR uses the NCBI Taxonomy to translate the BLAST results into the desired classification. The Taxonomy dump can be downloaded from:
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
After downloading, extract it and note it’s location. By default, SIDR checks the directory listed in $BLASTDB, however this can be changed at runtime.
Running With Raw Input¶
By default, SIDR will analyse your data and construct a Decision Tree model based on the GC content and average coverage of your contigs. Before running SIDR, you will need to prepare some data based on your input. This is described here: Data Preparation
To run SIDR with the default settings on raw data, enter a command like:
sidr default -d [taxdump path] \
-b [bamfile] \
-f [assembly FASTA] \
-r [BLAST results] \
-k tokeep.contigids \
-x toremove.contigids \
-t [target phylum]
Running With a Runfile¶
SIDR can take a “runfile” with pre-computed variables as input. The runfile should be a comma delimited file starting with a header row. A column named “ID” which contains the contigID must exist, along with an “Origin” column with the name of the organism identified by BLAST for contigs where one was found. All other columns are used as variables for the decision tree, except any titled “Covered_bases”, “Plus_reads”, or “Minus_reads” as those are present in BBMap default output yet should not contribute to model construction.
To run SIDR in runfile mode, enter a command like:
sidr runfile -d [taxdump path] \
-i [runfile path] \
-k tokeep.contigids \
-x toremove.contigids \
-t [target phylum]
Runfile Example¶
SIDR’s runfile mode accepts any comma-delimited file with at least an “ID” column with the contigID, an “Origin” column with the species name as identified by an outside classification tool, and one or more variable columns with which to construct the model. One way to construct a runfile with GC content and coverage variables calcluated with BBTools is described below. For this example, you will need:
- A preliminary assembly
- An alignment back to that preliminary assembly
- A copy of the NCBI Taxonomy Dump
- A local BLAST database
BLAST the assembled data:
blastn \ -task megablast \ -query [assembly FASTA] \ -db nt \ -outfmt '6 qseqid qlen staxids bitscore std sscinames sskingdoms stitle' \ -evalue 1e-25 \ -max_target_seqs 2 \ -out blast.out
Select the best BLAST hits:
cat blast.out | awk '!_[$1]++' | cut -f 1,2,15 | sed 's/scaffold_//g' | sort -k1n > scaffold_identities.txt
Use BBTools to calculate GC content and coverage:
pileup.sh countgc=t out=[organism].out in=[assembly BAM] ref=[assembly FASTA]
Format the output from BBTools:
cat [organism].out | sed '1d' | sed 's/scaffold_//g' | sort -k1n > [organism].sorted
Combine the BBTools and BLAST outputs:
paste [organism].sorted scaffold_identities.txt | cut -f 1-9,12 | sed 1i"ID Avg_fold Length Ref_GC Covered_percent Covered_bases Plus_reads Minus_reads Read_GC Origin" | tr '\t' ',' > [organsim].csv