About Imputation

Imputation is the process of replacing missing information based on existing information, using statistical models.

Imputation in Genetics means using statistical inference to deduce unknown genotypes based on known ones.
DNA.Land uses the ShapeIT and Impute2 programs as part of the imputation pipeline.

To illustrate, given this sentense, is it possible know what is the missing letter (i.e. to impute the missing letter) ?

I saw a blue ca_ on your head

If you are an English speaker, you should not have any problem to complete this sentence.
But wait! How did you do that? If any letter is acceptable, there are 26 possible sentences!

Of course, as an English speaker, you can make some requirements, such as ensuring it is a real word in the English language, then b,d,l,m,n,p,r,t now become much more likely than other letters (corresponding to cab,cad,cal, cam,can,cap,car,cat, respectively).

Then, you can make even further requirements, such as preferring more commonly used words, leave b,n,p,r,t as likely answers.

Last, with even stricter requirements that the sentence makes sense in the real physical world, the only letter that makes sense is p and the sentence is I saw a blue cap on your head.
This is the same way that genetic imputation works. We take whole genome sequencing data to create dictionary of genomic 'text' (known as haplotypes). Then, we ask the imputation algorithm to use this dictionary to complete missing letters (genotypes) so they will make sense and match the words in the dictionary.

The above qualitative explanation applies to DNA.Land's VCF files as well:
Uploaded genotype files (e.g. from 23-and-Me) contains between 500,000 to 1,000,000 SNPs.
DNA.Land's imputation pipeline imputes (i.e. infers the value of) additional 38,000,000 SNPs.

However, imputation is not always so easy. Think about the sentence:
I saw a blue ca_ yesterday.
Here our linguistic requirements do not work so well, and we have several possibilities to complete the sentence. However, we can associate these possbilties with likelihoods. For example, "car" will have higher likelihood than "can", "cat" or "cab" because, the latter words are uncommon.

The take home message is that each imputed genotype has some likelihood to be incorrect. As we saw in the example above, common words are easy to impute while rare words are harder. This is also true with genetic variants: common variations that appear in a frequency of >5% of the population are usually more accurate than very rare variation. Thus, using imputed data to check if you are a carrier for a rare genetic disorder (e.g. Joubert Syndrome) is a bad idea. Another complication is that imputation show variable success rates for different populations. You can think about imputing human genomes as imputing different languages. Our imputation algorithm knows a large number of "languages". However, certain populations have more "dialects" (genetic variations) than others. For example, we expect to do a better job on genomes of European and East Asians than Africans because the genetic variance in Africa is the highest in all continents.

Any reported value should never be taken as-is without further careful analysis.



Back