Much of this work is due to my PhD student Kristiina Ausmees.
Our preprint, currently under editing for resubmission, is available at https://www.biorxiv.org/content/10.1101/2020.09.30.320994v1.full
Much of what we discuss will be beyond what's available there.
In order to use convolutional networks that can work well for image data, we need to take the differences for genomes into account.
$\sum_i y_i \cdot \log \hat y_i + (1-y_i) \cdot \log(1-\hat y_i)$
Symmetric for 0/1, $y_i$ are target values, $\hat y_i$ model outputs.
$\sum_i y_i \cdot \log \hat y_i + (1-y_i) \cdot \log(1-\hat y_i)$
Symmetric for 0/1, $y_i$ are target values, $\hat y_i$ model outputs.
$\sum_i y_i \cdot \log \hat y_i + (1-y_i) \cdot \log(1-\hat y_i)$
Symmetric for 0/1, $y_i$ are target values, $\hat y_i$ model outputs.
$\sum_i \sum_j y_{ij} \log \hat y_ij$
Not a single scalar as output, but three values.
How do we create this output?
$\sum_i \sum_j y_{ij} \log \hat y_ij$
Not a single scalar as output, but three values.
How do we create this output? Tried two main options:
When would you expect either of these to work well?
We might have a shortage of training data.
Main strategies:
Create more input data from your input data.
For images:
For us:
How do we evaluate a dimensionality reduction?
Remember, we train our model to do genotype reconstruction, but we do not (solely) care about genotype reconstruction.
PCA 2D visualization
Our 2D visualization
Main hues, indicating contintens, separated more clearly. Individual colors and markers, indicating subpopulation labels, separated within colors. More space used. A "better" clustering.
Can we quantify this?
The model did not see our population labels. It just considered unsupervised genome reconstruction. Similar genomes end up together in the embedding.
We can try to label each node using k-Nearest Neighbors (kNN). A good score there should indicate a good separation between subpopulations. Being able to separate subpopulations without training for it would indicate a well-resolved embedding.
In population genetics studies on humans, admixture models as implemented in ADMIXTURE have been popular. Specific tool with specific model, sort of the opposite to doing PCA.
Assume $k$ founder populations, find the proportions $p_{ij}$ for the contribution of each of the $k$ populations to individual $i$, summing to $1$.
Change the latent layer. Rather than 2 components regularization and variational term, just change to $k$ dimensions, and normalize all dimensions to sum to 1.
What we have now is really four parts:
If we move from population to quantitative genetics, we want to know one thing:
Master thesis student Karthik Nair tried this. Various hurdles, including a pandemic...
The categorical crossentropy loss will only look at individual genotypes in isolation.
What if we could get the model to identify anything "weird"?
We have an encoder $E$ and a decoder $D$. Both are trained on the reconstruction loss.
In a typical generative adversarial network (GAN), there is a generator $G$ and a discriminator $D$
Joint training on reconstruction loss, and on fooling the discriminator.
What network can we use as the basis for the discriminator?
Convolutional autoencoders can perform better than plain autoencoders.
A neural system can be tuned to do tasks that would rqeuire custom-made code in the classical world. The bulk of the work goes into finding a proper architecture, training, and augmentation strategy. Those can be shared. Co-training for multiple losses can also be beneficial.