Why You Should Merge Your Factory DNA Kits: SNP Overlap Demystified
In this blog, I want to briefly discuss why it is best
practice to merge your factory kits from different testing companies prior to
working with third party raw DNA tools such as Borland Genetics or GEDmatch. By factory kits, I simply mean raw DNA data
files that were created directly by the testing companies by processing actual
DNA samples.
The first concept to understand is that different testing
companies sample different sets of data points along your chromosomes. We refer to the totality of the data points
sampled in a single DNA kit that kit’s “template.” Furthermore, individual testing companies
have changed their chips over the years, also resulting in different kit
templates. When you hear people talking
about an Ancestry v1 or Ancestry v2 or 23 & Me v4 kit, they are referring
to their tests by their template revision version. The ISOGG Wiki does a great job of tracking information about precisely when changes in
templates occurred at the different testing companies.
An individual data point sampled and reported in your raw
DNA data file is called an SNP, short for single nucleotide polymorphism, and the
abbreviation is usually pronounced “snip.”
At each SNP, a raw data file records the observed genotype at a specific
location on a chromosome. The genotype might
be something like AA or CT or GG, and is composed of the abbreviations for two of
four nucleobases, adenine (A), cytosine (C), guanine (G), or thymine (T), that
comprise the genotype. (Some SNPs may also
be reported as I or D for insertions or deletions). We will return to SNPs and genotypes in greater
detail in future Borland Genetics learning materials. For this discussion, it is sufficient to know
that factory kits from different companies, or from the same company if they
were originally processed years apart, will likely have different templates,
consisting of different sets of SNPs.
The next concept we need to understand is that in order to
accurately determine whether two DNA kits share one or more matching segments,
their templates must be sufficiently compatible. That is, there must be sufficient overlap in
the SNPs tested by the two testing companies.
Some types of comparisons are notoriously incompatible and are known to
cause major problems for matching algorithms, such as when one attempts to compare
an Ancestry v1 kit with a 23 & Me v5 kit.
When you compare these two types of kits, you will generally not get
accurate results because they are said to have poor SNP overlap. For those who use GEDmatch, you see that these
types of comparisons are indicated in red or pink to point that out to their
users.
The chart above depicts a portion of a raw DNA file from my
Ancestry v1 kit, side-by-side with my 23 & Me v4 kit, to illustrate the
concept of SNP overlap. Here, I’ve used
green to denote an SNP that both testing companies have sampled. These data points can be used when making cross-platform
comparisons between these types of kits.
I’ve used red to denote an SNP that only Ancestry sampled (red like an
apple). I’ve used orange to denote an
SNP that only 23 & Me sampled (orange like an orange). Since of course you cannot compare apples to
oranges, these SNPs are relatively useless when making comparisons involving
these two types of DNA kits. (Note for
advanced readers: I’m setting aside the
concept of imputation for now as it is beyond the scope of this discussion). The number of useful overlapping SNPs for
comparison total only 8 in this example, out of a total of 30 combined SNPs
tested by the two companies. This is a
clear example of poor SNP overlap.
The third and final concept requisite to this discussion, is
that accurately comparing two strands of DNA, whether on the same template or
on different templates, requires a minimum overlapping SNP density. That is, there have to be enough compatible
SNPs (green) per unit of length, or we’re not going to get an accurate result
when comparing the kits to seek matching segments. Through my own research, I have determined
that the threshold is approximately 75 overlapping SNPs per cM. Below this threshold, we will run into
trouble in the form of an abundance of falsely matching segments. With respect to a small 7 cM matching
segment, for example, it would be wise to seriously second-guess the validity
of a “matching” segment if there are not at least 525 compatible SNPs being
compared (525 being 75 times 7). Even
with a 20 cM segment, it is far more likely to be a valid match if there are at
least 3,000 overlapping SNPs compared.
Just as accurate DNA matching relies upon sufficiency of SNP
overlap, SNP overlap also comes into play when reconstructing kits for your
ancestors. The primary two quality
metrics we have for assessing synthetic DNA kits such as those created using
Borland Genetics, are coverage and resolution.
We want to create kits with as high resolution as possible (defined as
calculated SNPs per cM), so that they will perform as expected when used in
comparisons with other kits. Kits with
higher resolution generally, will be more likely to have sufficient overlapping
SNPs for use in comparisons with the different types of factory kits.
One thing that we can and should do if we’ve tested with
more than one testing company, is create a merged kit. Both the GEDmatch “super-kit” tool and the
Borland Genetics “Humpty Dumpty” tool (option 3) accomplish this task, i.e.
they both take all of the data from the SNPs reported by both testing companies
and create a kit for you on an expanded or combined template.
Now, let’s return to some template comparison charts. The comparison above illustrates the SNP
overlap between a Borland Genetics merged kit and a factory Ancestry kit. The overlap issue has been resolved, since when
comparing the Borland Genetics merged kit to an Ancestry kit, now 14 SNPs will
be factored into the comparison, just as if both kits were from Ancestry.
For sake of completeness, let’s finally compare the Borland Genetics merged kit and the 23
& Me kit. In this comparison, 24
snips will be considered. Remember, a
comparison between an Ancestry and a 23 & Me kit over this same span of
chromosome would only consider 8 snips. The
rest were apples or oranges. A
comparison between this type of merged kit and a factory 23 & Me kit of
another donor will be of the same accuracy as if both kits were 23 & Me
kits. In other words, the merged kit performs
well in comparisons with kits from either factory template.
Furthermore, when you match this merged kit with say a
MyHeritage or FTDNA kit, even though the merge was between an Ancestry and a
23&Me kit, the merged kit is going to have increased resolution generally,
and is also going to perform better in cross-platform matching with tests from
other companies.
Likewise, when you use one of these merged kits as input for
a DNA reconstruction workflow (whether the GEDmatch Lazarus tool or tools in
the Borland Genetics suite), the output, or your reconstructed ancestor kit, is
going to have much higher resolution than had you used either one of the
individual factory kits as input.
Finally, I have attached a link to a YouTube tutorial on how
to create a Humpty Dumpty kit using the Borland Genetics web tools.
Comments
Post a Comment