Code genotypes as integers (we count the number of derived alleles per genotype):
AA\rightarrow 0 Aa\rightarrow 1 aa\rightarrow 2
Let G_i denote a single row: genotype of individual i
Early work performed hierarchical clustering on row-wise differences.
Simple idea that clustered individuals to continent.
Assumptions:
We also assume K populations, L SNPs and record allele frequencies in P
We want to model the genotypes of an individual given ancestry components and population allele frequencies.
Assume no admixture, e.g., Q={0,0,0,1,0}
Genotype at SNP l results from sampling genotypes from the corresponding population allele frequency.
Now an individual has ancestry from multiple populations. First therefore it must sample the population, and given the population, can sample random genotypes from the population entry. Here we assume Q_i={0,0.25,0,0.75,0}.
Here we assume P, Q known - we could for instance use compiled databases of population allele frequencies. But we can estimate P and Q directly from data - and this is what software packages ADMIXTURE(Alexander et al., 2009) and STRUCTURE(Pritchard et al., 2000) do.
Population Genomics in Practice