What's the Difference Between Build 36 and Build 37, and Should I Care?


To even begin to tackle this problem, we’re going to first need to understand what a “build” is.  Perhaps you may have seen references to “build 36” or “build 37” on FTDNA or GEDmatch or Borland Genetics, or to “build 38” on SNPedia.  Never heard of SNPedia?  We’ll cover that in future editions in this blog.  But to get to the point, when you hear these terms, we are referring to sequential versions of the accepted map of the human genome by the scientific community (or a dedicated group of scientists thereof known as the Genome Reference Consortium).  Each build represents a refined understanding of the sequence of base pairs along our chromosomes.

A loose analogy to build versions would be the sequential editions of the Rand McNally road atlas, but with one major difference.  When Rand McNally publishes subsequent editions of its atlas, it generally does so because the network of roads has changed due to the fact that new roads have been built and the fact that existing routes have been modified.  However, when we talk about subsequent builds of the human genome, the differences from the previous version are not characterized by changes in the geography of our chromosomes, but rather reflect a refinement in our understanding of where our genes are located (and have been located all along) on our chromosomes.  So a better analogy might be if Rand McNally made an atlas of Mars, and each subsequent version was based on clearer photographs, as the ability to photograph the planet improved over time.  Build 37 of Rand McNally’s Atlas of Mars would therefore represent the 37th time their cartographers decided that improvements in technology warranted replacing the previous version or build, with a new set of maps.
Ridges and trenches of the Medusae Fossae as viewed from space.

When Rand McNally releases a new version of their atlas, they also change the index, because maybe their cartographers discovered that a ridge of the Medusae Fossae (giant trenches on the planet) was on the map grid at D-16, but refinement led the mapmakers to move it a bit to E-16 based on higher resolution images of the planet, requiring that the atlas be re-indexed.  If you want to be really technical, it’s more like the previous images only showed 8 ridges, and the new clearer image showed 10, so the ridges themselves need to be renamed on the map.  Since they were previously named from west to east, Ridge #1 kept its position, but all the way to the east, Ridge #8 had to be renamed Ridge #10, and maybe Ridge #5 got renamed Ridge #6.  If you’re keen on math, you could refer to the differences between the maps as differences in coordinate systems.  If you’re not, don’t fret, but you might see that terminology used sometimes, i.e. “build 37 coordinates.”

Good times, guiding Relative Race team green (season 2) across Florida with my outdated road maps!

When finances are tight, maybe you can’t afford Rand McNally’s 37th Atlas of Mars, so you’ll rely on the 36th edition.  What will the consequences be?  You’ll probably be fine on your journey if you are traveling in the west parts of the region, but you might miss your exit entirely when navigating the eastern portions of the range, if the exit signs use build 37 enumeration.

I think you’ve got it now, so let’s move the discussion forward to builds of the human genome as they apply to practical genetic genealogy.  First, let’s get an easy concept out of the way.  No currently popular genetic genealogy websites, software or tools that I can think of (and I’ve been known to dabble) indicate their results in builds 1 through 35, nor accept builds 1-35 coordinates as input.  You can think of these as historical atlases for our purposes.  They were important and revolutionary at their time, but the Tim Hortons currently in Medusa Trench #8 wasn’t even built yet when the ridges and valleys were mapped and assigned their numbers.  So we would only even consider using builds 36 through 38 if we’re attempting to navigate our chromosomes for modern purposes.

The next big concept is that whenever a website or tool gives you segment data (or asks you to input or transfer segment data), you need to know (or the website or tool needs to know), in which build coordinates the start and stop points on your chromosomes are being reported.  Otherwise, depending on what page of the atlas (which chromosome) and where on the map your segment lies (toward the west or toward the east), you might be providing miles while the website or tool might be expecting kilometers.  On sites like DNA Painter, where you are not prompted for build information, you might unknowingly be providing some segment data (say from 23 & Me) that’s measured in kilometers/build 37 and some data (say from FTDNA) that’s measured in miles/build 36, and expecting your results to make sense.

Of all the major genetic genealogy tools on the market, only GEDmatch has some features where you can work in build 38 (but they also provide data in builds 36 and 37), and the website SNPedia (sort of a Wikipedia of human genes) refers to build 38 positions unless you go back in the article history.  All other players in the market continue to use either build 36, build 37 or both, and that’s probably not likely to change soon, as any anticipated benefit in terms of accuracy would be far outweighed by the cost of overhauling/converting data and/or algorithms.

Depicted is the gene responsible for my colorblindness, as indexed on SNPedia.  Note the reference to GRCh38, which is a fancy abbreviation for Genome Reference Consortium build 38.

Now that we know we only have to worry about build 36 or build 37, let’s talk about where it makes a difference.  First off, let’s talk raw data transfer.  As far as providing users their DNA data as a raw data file, all sites with the exception of FTDNA exclusively provide build 37 data.  FTDNA, instead, provides users with flexible but confusing options of builds 36 or 37, and then asks users to select whether the data is to be concatenated or not (i.e., whether or not to include the X chromosome).  For purposes of transferring the raw DNA information to other sites (including Borland Genetics), you will always select “build 37 concatenated” (although GEDmatch may still also allow “build 36 concatenated” uploads and provide a conversion).  So that’s easy; when it comes to raw DNA data, build 37 is king!

Let’s move on to segments.  Tool suites like DNA Gedcom, that gather segment data from the various testing company sites, form their clusters and perform their analyses using genetic networks that do not rely upon build.  You do not have to worry about “which build?” on these types of sites, because they form their genetic networks using data from one site at a time.

Sites like DNA Painter and Borland Genetics, however, allow for cross-comparisons of segment data in some form or another.  Each of these sites take a different approach.  With DNA Painter, you are in total control.  You are not prompted for build number, and if you want to keep track of it, you should record which build you used in compiling each of your chromosome maps in the caption at the top of each map, and you should be consistent on any given map.  (I’ll illustrate the results of inconsistency a bit further down).  23 & Me, MyHeritage and Borland Genetics report all segment data in build 37.  FTDNA reports segments in build 36, as did “classic GEDmatch.”  Ancestry takes an aggressively prudent approach to segment analysis, and simply refuses to provide segment data to its customers.  The new post-merger GEDmatch gives users the option to select between build 36, 37 or 38 for output, but build 37 is the default.  Borland Genetics prompts the user to choose build 36 or build 37 coordinates when providing segment data, and will automatically make the necessary conversions at the time when its reconstruction tools are applied within segment boundaries.

Borland Genetics 'Extract Segments" tool interface, prompting user to select between builds.

What if we’re not consistent?  To paraphrase, what if we create a DNA Painter chromosome map and use some segments from FTDNA and some from 23 & Me?  Or what if we select the wrong build in Borland Genetics and then try to extract segments corresponding to the ancestors we’ve mapped out?  The short answer is that it will depend on which page in the atlas we’re looking at (which chromosome), and how far west to east along the chromosome our mapped segments fall.  The chart below illustrates the maximum degree of divergence on each chromosome, according to my calculations.

Green is OK.  Red is bad.  Note units of MBP (Mega-Base Pairs) which is not a 1:1 translation to cM.

So if you’re mixing and matching builds on chromosomes 2, 4, 5, 6, 7, 8, 10, or the X chromosome, it’s not going to be a big deal (but if the goal is to create accurately reconstructed ancestors using Borland Genetics, it pays to be consistent).  Chromosomes marked in orange will cause you minor problems when mixing and matching builds.  Chromosomes 1, 15, 17 and 19 will be completely wonky!  If you attempt to compare apples and oranges on the east side of chromosome 19, the 4.7 MBP offset at that end will result in a major discrepancy.  That portion of the chromosome also has a very high recombination rate, so 4.7 MPB at the east tip corresponds to almost 20 cM.

That is to say, if you use build 36 coordinates for a sizable 20 cM segment at the east tip of chromosome 19, and compare it to an identical segment recorded in build 37 coordinates, when you enter the segment data into DNA Painter, the segments will appear to not even overlap!  Therefore, if you wish to make an accurate map that includes both FTDNA and 23 & Me segments, you’ll need an app that converts between builds before adding the build 36 FTDNA data to your map.  For example, you can use the free conversion tool listed under the “New Free Tools” menu at Borland Genetics, which allows you to paste data with columns of “chromosome start stop” from Excel and convert between builds.

For sake of completeness, the Borland Genetics “Phase Map Locker” allows you to upload data in build 36 or build 37 (and you must select which).  It then allows for download in the same coordinates as originally uploaded.  However, when Borland Genetics applies the map in a reconstruction workflow, it automatically makes any necessary conversions, assuming the user properly identified the build when the map was first uploaded.

If you got lost anywhere above, trying to make sense of the technical details, don’t worry.  As long as you understand the following principles of genetic genealogy, you will be ahead of the pack:
  1. Build 37 is king!  (Of everyone except FTDNA)
  2. Don’t cross the streams!  (Use build 36 or build 37, but never both)
  3. For raw data transfers from FTDNA, if you select “build 37 concatenated,” you will never be led astray!
  4. Don’t get off at Exit 6 and expect to find a Tim Hortons.  They renumbered the exit to 8 when they switched to the metric system.  Especially on chromosome 19!!!


Comments

Popular posts from this blog

Introducing the Borland Genetics Segment Lab