HomeOverviewmembersProjectsNeale Lab

SNP Discovery and Analysis Pipeline

Pipeline Image

pipeline_v1.jpg

parse_agen.pl provides automated downloading, reorganization, processing and error checking of information as it comes in from Agencourt. Performs checks for proper directory structure, null files, file sizes, locations, etc.

Custom phredPhrap

The following Phred commands have been implemented for use against a modified phredPhrap run: —-

-phred -sd ../seq_dir/ -qd ../qual_dir/ - trim_alt 0.01 -trim_cutoff 0.01 -trim_fasta -st fasta -trim_phd

Ace2Fasta will convert the ace format (ace_fasta_contigs.pl):

OUTPUTS:

  • [amplicon_name]_contigs.fasta
  • Multiple FASTA files

ProbconsRNA

probcons [amplicon_name]_contigs.fasta > 0[amplicon_name] _contigs.fasta_align

ProbconsRNA Information

alignedcontgi2readfasta.pl reads in [amplicon_name] _contigs.fasta_align, reads in reads aligned to each contig from the individual contig files from ace2fasta.

OUTPUTS

  • [amplicon_name]_aligned.fasta

a single multifasta file with all reads aligned corresponding to their alignment to the contig from phredphrap and the contigs alignment to the other contigs from probconsRNA.

Fasta2Ace will convert back to Ace format:

Fasta2Ace.pl [amplicon_name]_aligned.fasta ../phd_dir/ > [amplicon_name]_aligned.ace.2

Polybayes Parameters

polybayes -maskAmbiguousMatches -reportOut pb_[amplicon_name].out -ac

eIn [amplicon_name]_aligned.ace.2 -readPhdFiles -phdFilePathIn ../phd_dir -inputFormat ac e -thresholdSnp .1 -screenSnps -preScreenSnpsMinimumBaseQuality 20 -priorPoly .0 1333 -priorPoly2 .99666 -priorPoly3 .00333 -priorPoly4 .00001 -priorPolyAC .1666 -priorPolyAT .1666 -priorPolyAG .1666 -priorPolyCG .1666 -priorPolyACG .25 -pri orPolyACT 25 -maxTerms 60 -displayQuality

Polybayes

PolyPhred Parameters

polyphred -snp hom -f 50 -indel -o pp_[amplicon_name].out

PolyPhred

polybayes_parse.pl and polyphred_parse.pl extract SNP locations, surrounding bases, and probability scores. These two sequences are currently being wrapped into pb_pp_parser.pl to faciliate both the extraction and quick comparison between the two sets

feature_extract.pl is responsible for gathering sequence information as dervied from phredPhrap and alignments. In addition, information about from the polybayes parameters calculations, polybayes probabilities, and polyphred scores is also extracted.

All of the following 14 metrics will be considered for classification and/or learning. Two types of classification and one feed-forward back-prop NN will be used for evaluation. The classification tree should give us a better indication of critical parameters.

Features Representation Algorithm App
Base Quality num continuous ID3/C4.5/ANN
INDEL categorical (1/0) ID3/C4.5/ANN
Species categorical (1-6) ID3/C4.5/ANN
Average Quality +/- 10 bases num continuousID3/C4.5/ANN
Average Global Quality num continuous ID3/C4.5/ANN
Frequency of Major Allele num continuous ID3/C4.5/ANN
Frequency of Minor Allele num continuous ID3/C4.5/ANN
Average Quality of Minor Allele num continuous ID3/C4.5/ANN
Average Quality of Major Allele num continuous ID3/C4.5/ANN
Relative Position num continuous ID3/C4.5/ANN
Relative Position of Next Variation num continuous ID3/C4.5/ANN
Local SNP Frequency +/- 30 num continuous ID3/C4.5/ANN
Polybayes Probability num continuous ID3/C4.5/ANN
PolyPhred Score num continuous ID3/C4.5/ANN

The training set is composed of a total of 300 validated sequences. These have been divided to represent the relative percentages of sequence source: 66% UGA, 12% UMN, and 22% Agencourt. A total of 198 UGA sequences, 36 UMN, and 66 Agencourt. Testing and validation will be performed by a set of similiar composition.

HOME | RESEARCH | PUBLICATIONS | STAFF | EMPLOYMENT 
 
e end -->