Runfile Example¶
SIDR’s runfile mode accepts any comma-delimited file with at least an “ID” column with the contigID, an “Origin” column with the species name as identified by an outside classification tool, and one or more variable columns with which to construct the model. One way to construct a runfile with GC content and coverage variables calcluated with BBTools is described below. For this example, you will need:
- A preliminary assembly
- An alignment back to that preliminary assembly
- A copy of the NCBI Taxonomy Dump
- A local BLAST database
BLAST the assembled data:
blastn \ -task megablast \ -query [assembly FASTA] \ -db nt \ -outfmt '6 qseqid qlen staxids bitscore std sscinames sskingdoms stitle' \ -evalue 1e-25 \ -max_target_seqs 2 \ -out blast.out
Select the best BLAST hits:
cat blast.out | awk '!_[$1]++' | cut -f 1,2,15 | sed 's/scaffold_//g' | sort -k1n > scaffold_identities.txt
Use BBTools to calculate GC content and coverage:
pileup.sh countgc=t out=[organism].out in=[assembly BAM] ref=[assembly FASTA]
Format the output from BBTools:
cat [organism].out | sed '1d' | sed 's/scaffold_//g' | sort -k1n > [organism].sorted
Combine the BBTools and BLAST outputs:
paste [organism].sorted scaffold_identities.txt | cut -f 1-9,12 | sed 1i"ID Avg_fold Length Ref_GC Covered_percent Covered_bases Plus_reads Minus_reads Read_GC Origin" | tr '\t' ',' > [organsim].csv