TreeGenes :: Help :: Tutorials
Back to TutorialsNUCLEOTIDE DIVERSITY
Nucleotide diversity is a measure of polymorphism in a sample of gene sequences. It is a summary statistic used to represent patterns of molecular diversity within a sample of gene copies. This concept is tied to measures of diversity in other biological fields (e.g., diversity metrics in ecology) and is a similar measure to the expected heterozygosity (= gene diversity) for a sample of allelic states at a single locus. This concept was introduced by Nei and Li (1979).
Their formula for nucleotide diversity is:
,
where
is the proportion of different nucleotides between the ith and jth types of DNA sequences, and
and
are the respective frequencies of these sequences. The proportion of different nucleotides
can be modified using a variety of statistical models describing DNA sequence evolution (e.g., Jukes-Cantor, Kimura 2 Parameter, General Time Reversible).
The summation is taken over all distinct pairs i and j without repetition. That is:
,
where n is the number of sequences in the sample.
Tajima (1983) and Nei (1987) scaled this measure to the length of gene sequences being considered (L).
The formula now becomes:
.
Note: Often times
is symbolized by
, so that nucleotide diversity itself becomes
or
.
Tajima (1983) and Nei (1987) also provided an estimator for the total variance (sampling and stochastic).
It is given by the following formula:
,
where
is the point estimate of nucleotide diversity, n is the sample size (i.e., number of gene sequences sampled), and L is the length of the gene sequences.
REFERENCE
NEI, M. 1987. Molecular Evolutionary Genetics. Columbia University Press, New York, NY, USA.

