Why You Should Merge Your Factory DNA Kits:  SNP Overlap Demystified


In this blog, I want to briefly discuss why it is best practice to merge your factory kits from different testing companies prior to working with third party raw DNA tools such as Borland Genetics or GEDmatch.  By factory kits, I simply mean raw DNA data files that were created directly by the testing companies by processing actual DNA samples.

The first concept to understand is that different testing companies sample different sets of data points along your chromosomes.  We refer to the totality of the data points sampled in a single DNA kit that kit’s “template.”  Furthermore, individual testing companies have changed their chips over the years, also resulting in different kit templates.  When you hear people talking about an Ancestry v1 or Ancestry v2 or 23 & Me v4 kit, they are referring to their tests by their template revision version.  The ISOGG Wiki does a great job of tracking information about precisely when changes in templates occurred at the different testing companies.

An individual data point sampled and reported in your raw DNA data file is called an SNP, short for single nucleotide polymorphism, and the abbreviation is usually pronounced “snip.”  At each SNP, a raw data file records the observed genotype at a specific location on a chromosome.  The genotype might be something like AA or CT or GG, and is composed of the abbreviations for two of four nucleobases, adenine (A), cytosine (C), guanine (G), or thymine (T), that comprise the genotype.  (Some SNPs may also be reported as I or D for insertions or deletions).  We will return to SNPs and genotypes in greater detail in future Borland Genetics learning materials.  For this discussion, it is sufficient to know that factory kits from different companies, or from the same company if they were originally processed years apart, will likely have different templates, consisting of different sets of SNPs.

The next concept we need to understand is that in order to accurately determine whether two DNA kits share one or more matching segments, their templates must be sufficiently compatible.  That is, there must be sufficient overlap in the SNPs tested by the two testing companies.  Some types of comparisons are notoriously incompatible and are known to cause major problems for matching algorithms, such as when one attempts to compare an Ancestry v1 kit with a 23 & Me v5 kit.  When you compare these two types of kits, you will generally not get accurate results because they are said to have poor SNP overlap.  For those who use GEDmatch, you see that these types of comparisons are indicated in red or pink to point that out to their users.


The chart above depicts a portion of a raw DNA file from my Ancestry v1 kit, side-by-side with my 23 & Me v4 kit, to illustrate the concept of SNP overlap.  Here, I’ve used green to denote an SNP that both testing companies have sampled.  These data points can be used when making cross-platform comparisons between these types of kits.  I’ve used red to denote an SNP that only Ancestry sampled (red like an apple).  I’ve used orange to denote an SNP that only 23 & Me sampled (orange like an orange).  Since of course you cannot compare apples to oranges, these SNPs are relatively useless when making comparisons involving these two types of DNA kits.  (Note for advanced readers:  I’m setting aside the concept of imputation for now as it is beyond the scope of this discussion).  The number of useful overlapping SNPs for comparison total only 8 in this example, out of a total of 30 combined SNPs tested by the two companies.  This is a clear example of poor SNP overlap.
The third and final concept requisite to this discussion, is that accurately comparing two strands of DNA, whether on the same template or on different templates, requires a minimum overlapping SNP density.  That is, there have to be enough compatible SNPs (green) per unit of length, or we’re not going to get an accurate result when comparing the kits to seek matching segments.  Through my own research, I have determined that the threshold is approximately 75 overlapping SNPs per cM.  Below this threshold, we will run into trouble in the form of an abundance of falsely matching segments.  With respect to a small 7 cM matching segment, for example, it would be wise to seriously second-guess the validity of a “matching” segment if there are not at least 525 compatible SNPs being compared (525 being 75 times 7).  Even with a 20 cM segment, it is far more likely to be a valid match if there are at least 3,000 overlapping SNPs compared.

Just as accurate DNA matching relies upon sufficiency of SNP overlap, SNP overlap also comes into play when reconstructing kits for your ancestors.  The primary two quality metrics we have for assessing synthetic DNA kits such as those created using Borland Genetics, are coverage and resolution.  We want to create kits with as high resolution as possible (defined as calculated SNPs per cM), so that they will perform as expected when used in comparisons with other kits.  Kits with higher resolution generally, will be more likely to have sufficient overlapping SNPs for use in comparisons with the different types of factory kits.

One thing that we can and should do if we’ve tested with more than one testing company, is create a merged kit.  Both the GEDmatch “super-kit” tool and the Borland Genetics “Humpty Dumpty” tool (option 3) accomplish this task, i.e. they both take all of the data from the SNPs reported by both testing companies and create a kit for you on an expanded or combined template.




Now, let’s return to some template comparison charts.  The comparison above illustrates the SNP overlap between a Borland Genetics merged kit and a factory Ancestry kit.  The overlap issue has been resolved, since when comparing the Borland Genetics merged kit to an Ancestry kit, now 14 SNPs will be factored into the comparison, just as if both kits were from Ancestry.


For sake of completeness, let’s finally compare the Borland Genetics merged kit and the 23 & Me kit.  In this comparison, 24 snips will be considered.  Remember, a comparison between an Ancestry and a 23 & Me kit over this same span of chromosome would only consider 8 snips.  The rest were apples or oranges.  A comparison between this type of merged kit and a factory 23 & Me kit of another donor will be of the same accuracy as if both kits were 23 & Me kits.  In other words, the merged kit performs well in comparisons with kits from either factory template.

Furthermore, when you match this merged kit with say a MyHeritage or FTDNA kit, even though the merge was between an Ancestry and a 23&Me kit, the merged kit is going to have increased resolution generally, and is also going to perform better in cross-platform matching with tests from other companies.

Likewise, when you use one of these merged kits as input for a DNA reconstruction workflow (whether the GEDmatch Lazarus tool or tools in the Borland Genetics suite), the output, or your reconstructed ancestor kit, is going to have much higher resolution than had you used either one of the individual factory kits as input.

Finally, I have attached a link to a YouTube tutorial on how to create a Humpty Dumpty kit using the Borland Genetics web tools.


Comments

Popular posts from this blog

Introducing the Borland Genetics Segment Lab