Exercise 2.1: Abba Baba test
Genotype data
A genotype matrix looks conceptually like this:
SNPs / Individuals -> Ind1 Ind2 Ind3 Ind4 ...
|
V
Snp1 G A A G . . .
Snp2 G T T C . . .
Snp3 T A T G . . .
Snp4 G A A C . . .
. . . . . .
. . . . . .
. . . . . .
We typically deal with dozens to tens of thousands of individuals, and hundreds of thousands to several million SNPs. In practice, to make this matrix easier computable, it is divided into three files:
- A file denoting the list of individuals (i.e. the column names)
- A file denoting the list of SNPs (i.e. the row names)
- A genotype matrix, containing the actual genotype data.
Let’s take a look at the individual file first. Type the following command into the “Terminal” in the lower-left pane in RStudio:
cat data/Abba_Baba_dataset/Abba_Baba_dataset.indNA12878.SG F CEU.SG
NA19240.SG F YRI.SG
VindijaG1_final_provisional.SG F VindijaG1_Neanderthal.SG
Chimp.REF M Chimp.REF
All commands in this exercise are carried out in the “Terminal”, not the “Console”!
where you see four individuals. In each row, the first column denotes the sample name, the second column the sex, and the third column a group label, which is irrelevant for now. Here is a quick summary for the four:
NA12878.SGis a present-day human from the US (with European ancestry).NA19240.SGis a present-day human from Nigeria.VindijaG1_final_provisional.SGis a Neanderthaler from Vindija Cave in Croatia, who died approximately 50,000 years ago.Chimp.REFis the Chimpanzee reference genome.
Next, let’s view the first few SNPs:
head data/Abba_Baba_dataset/Abba_Baba_dataset.snprs3094315 1 2.013e-2 752566 G A
rs12124819 1 2.0242e-2 776546 G A
rs28765502 1 2.2137e-2 832918 C T
rs7419119 1 2.2518e-2 842013 G T
rs950122 1 2.272e-2 846864 C G
rs113171913 1 2.3436e-2 869303 T C
rs13302957 1 2.4116e-2 891021 G A
rs59986066 1 2.4183e-2 893462 C T
rs112905931 1 2.426e-2 896271 T C
rs6696609 1 2.4457e-2 903426 T C
Each row contains a SNP, with an ID (first column), chromosome (second column), position in base-pairs along the chromosome (fourth column), and the two alleles present at this SNP (last two columns).
Let’s see how many SNPs there are:
cat data/Abba_Baba_dataset/Abba_Baba_dataset.snp | wc -l 1233013
So around 1.2 million SNPs.
Let’s now take a look at the genotype file:
head data/Abba_Baba_dataset/Abba_Baba_dataset.geno0009
0099
0299
0009
0090
0000
0092
9999
9909
0009
Here, we have genotype data encoded in the EIGENSTRAT format, which - in our case - contains the following code:
0-> Allele 1 (the “alternative allele”)2-> Allele 2 (the “reference allele”)9-> Missing data
Technically, the format also contains 1 for heterozygous genotypes, but the data we analyse here is “pseudo-haploid”, so represents only one allele for each individual.
Exercise: Obviously, the genotype file contains the right number of columns (four) for the four individuals we have. Check that it also contains the right number of rows, i.e. the same number that you obtained above using the wc command.
Exploring Neanderthal-human genetic similarity
Given that we now have a large number of genotypes at our hands, between Neanderthals and modern humans, we can ask a very simple question: Is the Neanderthal genome genetically more similar to one of the two modern humans than the other?
To answer this, we need to count in how many positions the Neanderthal matches each of the two modern human genomes. And to make this more meaningful, we focus on genetic variants which are different from Chimpanzee, i.e. mutations which appeared along the human lineage since the split from Chimpanzee.
We first select from the genotype file the relevant columns. For example, to select genotypes from the US-individual (first column), Neanderthal (third column) and Chimnpanzee (fourth column) we can use the command cut -c1,3,4, like this:
cut -c1,3,4 data/Abba_Baba_dataset/Abba_Baba_dataset.geno | head009
099
099
009
090
000
092
999
909
009
You may get a warning that says “Broken Pipe”, which can safely ignore. Piping a result of a computation into head is a very common trick in exploratory data analysis in the shell. It avoids a super-long output. Here, we just would like to see the first ten lines of the result to see whether it’s working as expected. head simply aborts the output after a 10 lines.
OK, this looks good. Now we have to filter lines at which the two genotypes in Neanderthal and Human are similar, but different from Chimpanzee. In other words, we are looking for two possible patterns in the three individuals: 002 or 220. We use the command grep, and then count the lines, using wc:
cut -c1,3,4 data/Abba_Baba_dataset/Abba_Baba_dataset.geno | grep -e 002 -e 220 | wc -l 65283
Task: Repeat the same exercise for the Nigerian individual (which is in column 2 instead of column 1 of the genotype file). You should get 61933.
Comparing the two numbers, we see that the US individual shares around 3000 more variants with the Croatian Neanderthal than the Nigerian individual does (65283 vs 61933).
The ABBA-BABA test
The analysis that we did above is very similar to what has been called the ABBA-BABA test in the population genetic literature (Green et al. (2010)). Specifically, when considering the quadruple of individuals as above, in the order:
- Outgroup (such as Chimp)
- Neanderthal
- Individual with Non-African ancestry
- Individual with African ancestry
we consider two types of allelic patterns, ABBA and BABA. The four letters refer to the alleles in the four individuals, and A and B simply denote the two alleles. The two patterns correspond to the above computed similarities of derived variants from Chimp:
- ABBA -> Here, the Non-African, but not the African, shares the derived allele with Neanderthal
- BABA -> Here, the African, but not the Non-African, shares the derived allele with Neanderthal.
Green et al. (2010) then defined a simple statistic:
\[ D = \frac{n_{ABBA} - n_{BABA}}{n_{ABBA} + n_{BABA}}. \]
Task: Above we have computed counts of the slightly simpler pattern AAB in their two realisations 002 and 220. To compute the D-Statistic, we need to operate on all four individuals, where ABBA is realised by 0220 and 2002, and BABA is realised by 2020 and 0202. Compute the count of ABBA and BABA using the above grep pipeline. You should get \(n_{ABBA} = 21787\) and \(n_{BABA} = 20250\).
Task: Open an R console to compute the D-Statistic itself, using the two numbers. The number you get is a (very rough) estimate of the admixture proportion of Neanderthal ancestry in modern humans with Non-African ancestry today.