What's the Difference Between Build 36 and Build 37, and Should I Care?
To even begin to tackle this problem, we’re going to first
need to understand what a “build” is.
Perhaps you may have seen references to “build 36” or “build 37” on FTDNA
or GEDmatch or Borland Genetics, or to “build 38” on SNPedia. Never heard of SNPedia? We’ll cover that in future editions in this blog. But to get to the point, when you hear these
terms, we are referring to sequential versions of the accepted map of the human
genome by the scientific community (or a dedicated group of scientists thereof known as the Genome Reference Consortium). Each
build represents a refined understanding of the sequence of base pairs along
our chromosomes.
A loose analogy to build versions would be the sequential
editions of the Rand McNally road atlas, but with one major difference. When Rand McNally publishes subsequent editions
of its atlas, it generally does so because the network of roads has changed
due to the fact that new roads have been built and the fact that existing routes have been
modified. However, when we talk about
subsequent builds of the human genome, the differences from the previous version
are not characterized by changes in the geography of our chromosomes, but
rather reflect a refinement in our understanding of where our genes are located
(and have been located all along) on our chromosomes. So a better analogy might be if Rand McNally
made an atlas of Mars, and each subsequent version was based on clearer
photographs, as the ability to photograph the planet improved over time. Build 37 of Rand McNally’s Atlas of Mars
would therefore represent the 37th time their cartographers decided that improvements
in technology warranted replacing the previous version or build, with a new set
of maps.
Ridges and trenches of the Medusae Fossae as viewed from space. |
When Rand McNally releases a new version of their atlas, they also change the index, because maybe their cartographers discovered that a ridge of the Medusae Fossae (giant trenches on the planet) was on the map grid at D-16, but refinement led the mapmakers to move it a bit to E-16 based on higher resolution images of the planet, requiring that the atlas be re-indexed. If you want to be really technical, it’s more like the previous images only showed 8 ridges, and the new clearer image showed 10, so the ridges themselves need to be renamed on the map. Since they were previously named from west to east, Ridge #1 kept its position, but all the way to the east, Ridge #8 had to be renamed Ridge #10, and maybe Ridge #5 got renamed Ridge #6. If you’re keen on math, you could refer to the differences between the maps as differences in coordinate systems. If you’re not, don’t fret, but you might see that terminology used sometimes, i.e. “build 37 coordinates.”
Good times, guiding Relative Race team green (season 2) across Florida with my outdated road maps! |
When finances are tight, maybe you can’t afford Rand McNally’s 37th Atlas of Mars, so you’ll rely on the 36th edition. What will the consequences be? You’ll probably be fine on your journey if you are traveling in the west parts of the region, but you might miss your exit entirely when navigating the eastern portions of the range, if the exit signs use build 37 enumeration.
I think you’ve got it now, so let’s move the discussion
forward to builds of the human genome as they apply to practical genetic
genealogy. First, let’s get an easy concept
out of the way. No currently popular genetic
genealogy websites, software or tools that I can think of (and I’ve been known
to dabble) indicate their results in builds 1 through 35, nor accept builds 1-35
coordinates as input. You can think of
these as historical atlases for our purposes.
They were important and revolutionary at their time, but the Tim Hortons
currently in Medusa Trench #8 wasn’t even built yet when the ridges and valleys
were mapped and assigned their numbers.
So we would only even consider using builds 36 through 38 if we’re
attempting to navigate our chromosomes for modern purposes.
The next big concept is that whenever a website or tool
gives you segment data (or asks you to input or transfer segment data), you
need to know (or the website or tool needs to know), in which build coordinates the
start and stop points on your chromosomes are being reported. Otherwise, depending on what page of the
atlas (which chromosome) and where on the map your segment lies (toward the
west or toward the east), you might be providing miles while the website or tool
might be expecting kilometers. On sites
like DNA Painter, where you are not prompted for build information, you might unknowingly
be providing some segment data (say from 23 & Me) that’s measured in kilometers/build
37 and some data (say from FTDNA) that’s measured in miles/build 36, and
expecting your results to make sense.
Of all the major genetic genealogy tools on the market, only
GEDmatch has some features where you can work in build 38 (but they also
provide data in builds 36 and 37), and the website SNPedia (sort of a Wikipedia
of human genes) refers to build 38 positions unless you go back in the article
history. All other players in the market continue to use either build 36, build 37 or both, and that’s probably not likely
to change soon, as any anticipated benefit in terms of accuracy would be far
outweighed by the cost of overhauling/converting data and/or algorithms.
Depicted is the gene responsible for my colorblindness, as indexed on SNPedia. Note the reference to GRCh38, which is a fancy abbreviation for Genome Reference Consortium build 38. |
Now that we know we only have to worry about build 36 or build 37, let’s talk about where it makes a difference. First off, let’s talk raw data transfer. As far as providing users their DNA data as a raw data file, all sites with the exception of FTDNA exclusively provide build 37 data. FTDNA, instead, provides users with flexible but confusing options of builds 36 or 37, and then asks users to select whether the data is to be concatenated or not (i.e., whether or not to include the X chromosome). For purposes of transferring the raw DNA information to other sites (including Borland Genetics), you will always select “build 37 concatenated” (although GEDmatch may still also allow “build 36 concatenated” uploads and provide a conversion). So that’s easy; when it comes to raw DNA data, build 37 is king!
Let’s move on to segments.
Tool suites like DNA Gedcom, that gather segment data from the various testing
company sites, form their clusters and perform their analyses using genetic
networks that do not rely upon build.
You do not have to worry about “which build?” on these types of sites,
because they form their genetic networks using data from one site at a time.
Sites like DNA Painter and Borland Genetics, however, allow
for cross-comparisons of segment data in some form or another. Each of these sites take a different
approach. With DNA Painter, you are in
total control. You are not prompted for
build number, and if you want to keep track of it, you should record which
build you used in compiling each of your chromosome maps in the caption at the
top of each map, and you should be consistent on any given map. (I’ll illustrate the results of inconsistency
a bit further down). 23 & Me,
MyHeritage and Borland Genetics report all segment data in build 37. FTDNA reports segments in build 36, as did “classic
GEDmatch.” Ancestry takes an
aggressively prudent approach to segment analysis, and simply refuses to
provide segment data to its customers. The
new post-merger GEDmatch gives users the option to select between build 36, 37
or 38 for output, but build 37 is the default.
Borland Genetics prompts the user to choose build 36 or build 37
coordinates when providing segment data, and will automatically make the
necessary conversions at the time when its reconstruction tools are applied within segment boundaries.
Borland Genetics 'Extract Segments" tool interface, prompting user to select between builds. |
What if we’re not consistent? To paraphrase, what if we create a DNA
Painter chromosome map and use some segments from FTDNA and some from 23 &
Me? Or what if we select the wrong build
in Borland Genetics and then try to extract segments corresponding to the
ancestors we’ve mapped out? The short answer
is that it will depend on which page in the atlas we’re looking at (which
chromosome), and how far west to east along the chromosome our mapped segments
fall. The chart below illustrates the
maximum degree of divergence on each chromosome, according to my calculations.
Green is OK. Red is bad. Note units of MBP (Mega-Base Pairs) which is not a 1:1 translation to cM. |
So if you’re mixing and matching builds on chromosomes 2, 4, 5, 6, 7, 8, 10, or the X chromosome, it’s not going to be a big deal (but if the goal is to create accurately reconstructed ancestors using Borland Genetics, it pays to be consistent). Chromosomes marked in orange will cause you minor problems when mixing and matching builds. Chromosomes 1, 15, 17 and 19 will be completely wonky! If you attempt to compare apples and oranges on the east side of chromosome 19, the 4.7 MBP offset at that end will result in a major discrepancy. That portion of the chromosome also has a very high recombination rate, so 4.7 MPB at the east tip corresponds to almost 20 cM.
That is
to say, if you use build 36 coordinates for a sizable 20 cM segment at the east
tip of chromosome 19, and compare it to an identical segment recorded in build
37 coordinates, when you enter the segment data into DNA Painter, the segments will
appear to not even overlap! Therefore,
if you wish to make an accurate map that includes both FTDNA and 23 & Me
segments, you’ll need an app that converts between builds before adding the
build 36 FTDNA data to your map. For example, you can
use the free conversion tool listed under the “New Free Tools” menu at Borland
Genetics, which allows you to paste data with columns of “chromosome start stop”
from Excel and convert between builds.
For sake of completeness, the Borland Genetics “Phase Map Locker” allows you to upload data in build 36 or build 37 (and you must select which). It then allows for download in the same coordinates as originally uploaded. However, when Borland Genetics applies the map in a reconstruction workflow, it automatically makes any necessary conversions, assuming the user properly identified the build when the map was first uploaded.
For sake of completeness, the Borland Genetics “Phase Map Locker” allows you to upload data in build 36 or build 37 (and you must select which). It then allows for download in the same coordinates as originally uploaded. However, when Borland Genetics applies the map in a reconstruction workflow, it automatically makes any necessary conversions, assuming the user properly identified the build when the map was first uploaded.
If you got lost anywhere above, trying to make sense of the
technical details, don’t worry. As long
as you understand the following principles of genetic genealogy, you will be
ahead of the pack:
- Build 37 is king! (Of everyone except FTDNA)
- Don’t cross the streams! (Use build 36 or build 37, but never both)
- For raw data transfers from FTDNA, if you select “build 37 concatenated,” you will never be led astray!
- Don’t get off at Exit 6 and expect to find a Tim Hortons. They renumbered the exit to 8 when they switched to the metric system. Especially on chromosome 19!!!
Comments
Post a Comment