parse_agen.pl provides automated downloading,
reorganization, processing and error checking of information as it
comes in from Agencourt. Performs checks for proper directory
structure, null files, file sizes, locations, etc.
Custom phredPhrap
The following Phred commands have been implemented for use against a modified phredPhrap run: —-
-phred -sd ../seq_dir/ -qd ../qual_dir/ - trim_alt 0.01 -trim_cutoff 0.01 -trim_fasta -st fasta -trim_phd
Ace2Fasta will convert the ace format (ace_fasta_contigs.pl):
OUTPUTS:
alignedcontgi2readfasta.pl reads in [amplicon_name] _contigs.fasta_align, reads in reads aligned to each contig from the individual contig files from ace2fasta.
OUTPUTS
a single multifasta file with all reads aligned
corresponding to their alignment to the contig from phredphrap and the
contigs alignment to the other contigs from probconsRNA.
Fasta2Ace will convert back to Ace format:
Fasta2Ace.pl [amplicon_name]_aligned.fasta ../phd_dir/ > [amplicon_name]_aligned.ace.2
Polybayes Parameters
polybayes -maskAmbiguousMatches -reportOut pb_[amplicon_name].out -ac
eIn [amplicon_name]_aligned.ace.2 -readPhdFiles -phdFilePathIn
../phd_dir -inputFormat ac e -thresholdSnp .1 -screenSnps
-preScreenSnpsMinimumBaseQuality 20 -priorPoly .0 1333 -priorPoly2
.99666 -priorPoly3 .00333 -priorPoly4 .00001 -priorPolyAC .1666
-priorPolyAT .1666 -priorPolyAG .1666 -priorPolyCG .1666 -priorPolyACG
.25 -pri orPolyACT 25 -maxTerms 60 -displayQuality
Polybayes
PolyPhred Parameters
polyphred -snp hom -f 50 -indel -o pp_[amplicon_name].out
PolyPhred
polybayes_parse.pl and polyphred_parse.pl extract SNP locations, surrounding bases, and probability scores. These two sequences are currently being wrapped into pb_pp_parser.pl to faciliate both the extraction and quick comparison between the two sets
feature_extract.pl is responsible for gathering
sequence information as dervied from phredPhrap and alignments. In
addition, information about from the polybayes parameters calculations,
polybayes probabilities, and polyphred scores is also extracted.
All of the following 14 metrics will be considered for
classification and/or learning. Two types of classification and one
feed-forward back-prop NN will be used for evaluation. The
classification tree should give us a better indication of critical
parameters.
| Features | Representation | Algorithm App |
| Base Quality | num continuous | ID3/C4.5/ANN |
| INDEL | categorical (1/0) | ID3/C4.5/ANN |
| Species | categorical (1-6) | ID3/C4.5/ANN |
| Average Quality +/- 10 bases | num continuous | ID3/C4.5/ANN |
| Average Global Quality | num continuous | ID3/C4.5/ANN |
| Frequency of Major Allele | num continuous | ID3/C4.5/ANN |
| Frequency of Minor Allele | num continuous | ID3/C4.5/ANN |
| Average Quality of Minor Allele | num continuous | ID3/C4.5/ANN |
| Average Quality of Major Allele | num continuous | ID3/C4.5/ANN |
| Relative Position | num continuous | ID3/C4.5/ANN |
| Relative Position of Next Variation | num continuous | ID3/C4.5/ANN |
| Local SNP Frequency +/- 30 | num continuous | ID3/C4.5/ANN |
| Polybayes Probability | num continuous | ID3/C4.5/ANN |
| PolyPhred Score | num continuous | ID3/C4.5/ANN |
The training set is composed of a total of 300 validated sequences.
These have been divided to represent the relative percentages of
sequence source: 66% UGA, 12% UMN, and 22% Agencourt. A total of 198
UGA sequences, 36 UMN, and 66 Agencourt. Testing and validation will be
performed by a set of similiar composition.