Help! My Segments Are So Sticky!
Back in the day, it used to be popular to refer to certain
segments as “sticky” when they appeared to be passed down from generation to
generation untouched. That sort of
name-calling has reduced greatly now that we have a clearer understanding of
the statistical rules that our chromosomes follow as a result of random
recombination. It turns out that the
smaller a segment, the more likely it is to escape the chopping block of
recombination in each generation and instead either be passed to the child in
full or not at all. Let’s take a look at
some numbers and see how this plays out.
As our starting point, we’re going to go back to our
definition of centiMorgan, as explained in my blog from a few weeks ago about
the statistical impossibility of two full siblings not sharing any DNA
segments. If you missed that one, that’s
OK, here’s the way I like to think about a cM:
A cM is a unit that denotes a span of a chromosome that has exactly 1% likelihood
of being split at least once by recombination, within a single generation. Conversely, a 1 cM span of chromosome has a
99% chance of avoiding recombination per generation.
So from this definition, let’s look at a 7 cM segment on a chromosome
in terms of “stickiness.” There are
exactly three possibilities in our inheritance model: A)
This 7 cM segment is passed intact from parent to child; B) the segment
is not passed at all and instead the parent passes genetic material from his or
her opposite parent to the child; or C) at least one recombination
occurs across this span and the parent passes pieces of both copies of his or
her chromosome across this span (say 4 cM from one grandparent and 3 cM from the
other).
First, let’s calculate the odds of no recombination occurring
across a 7 cM span of chromosome. We use
an “AND” operator and multiply the probabilities of no recombination on each 1
cM span that comprises the 7 cM in question.
So the odds of no recombination across the 7 cM span is restated as the
odds of no recombination on the first cM AND no recombination on the second cM
AND no recombination on the third cM, etc.
In statistics, independent probabilities linked by an “AND” operator can
be simply multiplied. Therefore, the
odds of no recombination across a 7 cM span of chromosome in a single
generation = 0.99 * 0.99 * 0.99 * 0.99 * 0.99 * 0.99 * 0.99, which we can express in
exponent notation as (0.99)^7, which according to my calculator is about
93%. The two possibilities (inherited in
the entirety and not inherited at all) must occupy this 93% equally (47% each,
rounding to the nearest whole percent), whereas the odds of the segment getting
the chop-chop is only 7%. Therefore, we
can say that a 7 cM segment is necessarily “sticky,” with the odds of recombination
in a generation (as opposed to acting all sticky) is low. This is not a property special to your 7 cM
segment. Rather, this applies to all 7
cM segments, whether or not you have matches on them and whether or not
endogamy is at play, and with total disregard for the age of this segment
(failing to recombine in a hundred years doesn’t make it any more likely to recombine
in the next generation).
Now, let’s extend this concept to segments of different lengths
in cM using Excel. Here’s the story from
7 cM all the way up to 40 cM:
cM
|
Odds of no recombination in a generation
|
Odds of at least one recombination across segment in a
generation
|
Odds of inheriting entire segment
|
Odds of not inheriting segment at all
|
7
|
93%
|
7%
|
47%
|
47%
|
8
|
92%
|
8%
|
46%
|
46%
|
9
|
91%
|
9%
|
46%
|
46%
|
10
|
90%
|
10%
|
45%
|
45%
|
11
|
90%
|
10%
|
45%
|
45%
|
12
|
89%
|
11%
|
44%
|
44%
|
13
|
88%
|
12%
|
44%
|
44%
|
14
|
87%
|
13%
|
43%
|
43%
|
15
|
86%
|
14%
|
43%
|
43%
|
16
|
85%
|
15%
|
43%
|
43%
|
17
|
84%
|
16%
|
42%
|
42%
|
18
|
83%
|
17%
|
42%
|
42%
|
19
|
83%
|
17%
|
41%
|
41%
|
20
|
82%
|
18%
|
41%
|
41%
|
21
|
81%
|
19%
|
40%
|
40%
|
22
|
80%
|
20%
|
40%
|
40%
|
23
|
79%
|
21%
|
40%
|
40%
|
24
|
79%
|
21%
|
39%
|
39%
|
25
|
78%
|
22%
|
39%
|
39%
|
26
|
77%
|
23%
|
39%
|
39%
|
27
|
76%
|
24%
|
38%
|
38%
|
28
|
75%
|
25%
|
38%
|
38%
|
29
|
75%
|
25%
|
37%
|
37%
|
30
|
74%
|
26%
|
37%
|
37%
|
31
|
73%
|
27%
|
37%
|
37%
|
32
|
72%
|
28%
|
36%
|
36%
|
33
|
72%
|
28%
|
36%
|
36%
|
34
|
71%
|
29%
|
36%
|
36%
|
35
|
70%
|
30%
|
35%
|
35%
|
36
|
70%
|
30%
|
35%
|
35%
|
37
|
69%
|
31%
|
34%
|
34%
|
38
|
68%
|
32%
|
34%
|
34%
|
39
|
68%
|
32%
|
34%
|
34%
|
40
|
67%
|
33%
|
33%
|
33%
|
Can you guess why I stopped at 40 cM? It’s because that’s where a segment will
exhibit equal probabilities of each of the 3 described scenarios, which I would
consider to be not very sticky.
But there’s more to this story. Surely, “stickiness” relates somehow to the
expected age of a segment. That is, let’s
ask the question “How old is my sticky little 7 cM segment likely to be?” In this calculation, we start with the premise
that it’s inherited in its entirety from one parent and that it’s a valid
segment of genetic material from just one copy of that parent’s chromosome. We’ve already calculated that probability at 93%,
but let’s switch our rounding to tenths of a percent for a bit more accuracy. Our 7 cM segment has a 93.2% chance of being
inherited without recombination in a generation. How about 2 generations? Well, it has to be passed down in one
generation AND another, so our probability of a segment being at least two
generations old is 0.932^2 = 86.9%. If
we continue with this drill, we will find the odds are still at 70.3% that the
segment is at least 5 generations. Let’s
switch from generations to years, since as genealogists we really care about
whether our matches are related to us in a historical timeframe when there are
records with which we can build our trees.
Let’s assume that the average generation span is 25 years, and
accordingly, 400 years ago (around the beginning of the genealogical era) takes
us back 16 generations. So what are the
odds that our 7 cM segment is older than 400 years? The answer is going to shock some people who
insist that autosomal matches only go back 400 years. In fact, the odds of a 7 cM segment on my genome
of exceeding 400 years in age is 30.2 %.
That is to say that for any given 7 cM segment, there’s about a 70%
chance that all of your matches on that segment have a common ancestor who was
born less than 400 years ago, but over 30% of the age of a segment this size
is going to be further back.
Well, how far back? I’ve
taken it upon myself to carry out some advanced statistical calculations with
which I won’t bore you, but I’ll give you some interesting figures, and I’ll show
you a few colorful charts that might change the way you think about your DNA
matches. First, let’s talk
quartiles. For a 7 cM segment, we can
divide the age ranges of our segment into statistical quartiles representing an
even 25% probability that the age of our segment is within 4 ranges. These quartiles are as follows: 0-100 years, 100-250 years, 250 to 500 years,
and >500 years. Based on this, our
expectation value is 250 years, but it’s just as likely that the segment is
over 500 years old as it is that the segment is say 100 to 250 years old. Let’s keep it going and talk 100 year ranges
instead of quartiles. I made a pie chart
to show you the probability that a 7 cM segment dates back to some different time-frames:
Yes, you’re reading this correctly. There’s a 6% chance that our 7 cM segment has
been passed down untouched for over 1000 years!
Now that’s what I call a sticky segment.
Eew! So what, it’s just 6%, right? Well, we have over 7000 cM of real estate on
our chromosomes and that’s not even counting the X (we’ll get to that in a
bit). So, that’s an expected 60 little
super-sticky segments where you’re never ever going to find your common ancestor
because he/she walked the earth over 1000 years ago. Now, when you see someone say from a
traditionally endogamous population, and they have like a zillion matches on a
lot of their segments, I want you to understand this. Their common ancestor from which they all
inherited their hyper-sticky segments will often have lived over 1000 years ago
and may be an ancestor of a large swath of the population (and many times over
due to intermarriage among even distantly related descendants over time). That doesn’t make a segment like this any
less real though, just less useful for genealogy in terms of finding the most
recent common ancestor you share with other matches thereon.
Next, let’s move on to 20 cM. Why 20 cM?
Because that’s where Ancestry sets the threshold for shared
matches. Here, the picture is very
different, with quartiles (rounding to the nearest generation) being 0-25 years,
25-75 years, and 75-175 years, and >175 years. It’s no surprise that Ancestry considers this
to be the fourth cousin boundary, since the average fourth cousin shares an
ancestor born about 125 years prior, and since a 20 cM single-segment match is
likely to be related at fourth cousin level or closer about 75.5% of the time! Here’s that same pie chart for a 20 cM
segment:
Awesome. There’s now
a 96% chance that our common ancestor for a shared segment of 20 cM lived
within the past 400 years. Note,
however, there is still a 1.8% chance that a 20 cM segment is over 500 years old!
Finally, the million dollar question that everybody’s
asking. What about that crazy X
chromosome? I heard those segments are
ancient if they’re not at least 15 cM.
Well, maybe so, but first let's talk about whether they’re even real
segments (IBD). One problem with the X
chromosome is that some parts are poorly sampled, with very low SNP counts per
cM. I personally recommend that a
segment have at least 75 tested SNPs per cM (and a minimum of 7 cM) before you
can rely on it being IBD. This just ain’t
happening on some parts of the X due to low sampling rate (SNP per cM). But let’s assume we’ve got a nice segment
with some good SNP density, but it’s not too long. Let’s take everyone’s favorite 15 cM
threshold and see what kind of stats we get in the context of segment dating.
First, we need another tool in our arsenal, and I’m going to
call it “effective generation span.”
While a generation span in real life might be 25 years or so, the “effective
generation span” of an X chromosome is 37.5 years. Here’s why.
Let’s talk about the last time an X chromosome segment had any chance of
recombining. That would be when a female
ancestor had it. Male’s X chromosomes
aren’t recombined when passed to their daughters because they only have one X
and therefore nothing with which to combine.
I’m ignoring PAR (pseudo-autosomal regions) because they’re puny and
practically useless for genealogy. So,
the last chance an X segment had to recombine was in a donor’s mother, or in a
donor’s father’s mother. Whether a
segment was inherited from either of those two ancestors we’ll assume is
equally likely for purposes of our discussion, and I’ll assert that this is a
reasonable assumption. So, from the last
opportunity for recombination, there’s a 50% chance of one generation (25
years) and a 50% chance of two generations (50 years). To calculate our effective generation span for
the X, we simply take the weighted average:
(0.5 * 25) + (0.5 * 50) = 37.5 years.
Then, we can use the same methodology as we did on the autosomal
chromosomes to calculate the age ranges of our favorite 15 cM segment on the X. Turns out there’s about an 80% chance that
such a match shares a MRCA within 400 years, matching the common wisdom in our
community that 15 cM is a nice place to start examining our X matches (given
that our comparison to the other donor includes at least 15 * 75 = 1125 SNPs). Here's the same pie chart for date ranges for a 15 cM X segment:
That’s all for this week.
Despite the liberties I’ve taken using the word “sticky,” as you now
know, there’s nothing inherently sticky about one segment vs. the other, but
rather segments only appear to “stick” because of their length in cM. Any apparent stickiness is simply a direct
result of the statistical nature of DNA inheritance, and the phenomenon applies
across the board to all small segments.
If you’ve enjoyed this post, I encourage you to check out my
website www.borlandgenetics.com
where I’m accepting uploads to an autosomal database that focuses on making
simple and powerful (and for the most part free) DNA reconstruction tools accessible
to the average genetic genealogist.
Nice read Borland!
ReplyDeleteVery interesting! Thanks for this. And while I tell people (other eastern Polynesians like myself) who are predicted 2nd - 3rd cousin matches to me or my relatives to look at the largest segment of at least 30cM in order to determine a true 2nd cousin relationship, this chart makes sense except for the 20cM.
ReplyDeleteAncestry's shared matches are based on TOTAL shared. While we can have a good 20cM total shared, the number of segments can be as much as 5 segments (just looking at my own). So if say there are 5 segments, make it 3 segments (I had a lot of 3 and 2 segments), that's about 6.6cM.
Would love to see you work with my data! ;) Thanks again for this though, definitely enlightening!
Excellent, Kevin. Thanks for explaining this. I still don't like the term "sticky" since there's nothing to prevent recombination from selecting the other chromosomal segment in the same location (as indicated in the column about odds of not passing on the segment at all). The probabilities you've given for various shared amounts of DNA and segment longevity is very helpful. It seems I routinely get confronted by outliers. Recently on behalf of someone looking for his great grandfather's father, we approached the grandfather of a match who shares 118 cM. For the number of generations, this indicated to us that this particular line was the correct one. But the the grandfather shared only 121 cM.
ReplyDeleteThis is incredibly in-depth, but you've done a great job of explaining it. Thank you for this!
ReplyDeleteGreat post, Kevin! Thank you for making this readable and understandable and for adding the graphics! This was a very helpful post.
ReplyDeleteAlthough the calculations in this piece are mathematically corect, I think they are conceptually wrong for genetic genealogy. The probabilities calculated are in a forward direction, answering the question "If two people share an ancestor n generations ago what is the probability that they share a segment of x cM from that ancestor?" Generally that is not the question we are interested in, we know for certain that two people share a segment of x cM and we want to know how long ago the ancestor was. This is the fundamental difference between this approach and the Speed and Balding approach summarised here https://isogg.org/wiki/Identical_by_descent. An analogous question, with known relationship and question about inheritance, would be If A and B are siblings, what is the probability they have the same colour eyes? (Answer: reasonably high). The converse is, a question with known genetics and unknown relationship: If A and B both have blue eyes how probable is it that they are siblings? (Answer: quite low.)
ReplyDeleteThis comment has been removed by the author.
DeleteI'd calculate the “effective generation span” of an X chromosome to be 25/3 (male X) + 25/3 (female X from mother) + 50/3 (female X from father) ⁼ 33.3 years.
ReplyDeleteThis comment has been removed by the author.
DeleteI was working tonight on writing code to generalize my equations for creating the pie charts in the article, and I revisited your question. The calculation of 37.5 years as the "effective generation span" considers the unique inheritance paths of the X chromosome and whether recombination is possible.
DeleteFor a female, the X chromosome can be inherited from either parent:
When inherited from the mother, recombination is possible every generation, yielding a generational span of 25 years.
When inherited from the father, the X chromosome is passed intact from the paternal grandmother without recombination. This path effectively "skips" a generation, leading to a span of 50 years.
The weighted average takes into account the equal probability (50%) of inheriting the X chromosome from each parent for a female. We calculate the average by multiplying the chances by the years and then adding them together:
Chance of inheriting from the mother: 50% times 25 years equals 12.5 years.
Chance of inheriting from the father: 50% times 50 years equals 25 years.
Adding these together gives us 37.5 years as the weighted average.
This weighted average is representative of the "effective generation span" for an X chromosome segment, considering the distinct inheritance patterns and the potential for recombination.
The alternative scenario described would suggest dividing the years by three and summing them up, yielding approximately 33.33 years:
Inheriting a male X (from the mother): One third of 25 years equals approximately 8.33 years.
Inheriting a female X (from the mother): One third of 25 years equals approximately 8.33 years.
Inheriting a female X (from the father): One third of 50 years equals approximately 16.67 years.
While the alternative method is a valid mathematical approach, it doesn't capture the generational "skip" when the X chromosome is inherited from father to daughter without recombination. Recombination and non-recombination should be equally weighted due to their equal probability. Therefore, 37.5 years more accurately reflects the "effective generation span" for the X chromosome in the context we're discussing.
I hope this explains the reasoning behind the calculation. If you have more thoughts or questions, I'm open to continuing the discussion!
Would you grant permission to quote you and a chart to a Family Genealogy Group? Great blog post!
ReplyDeleteSure, no problem.
DeleteMuch appreciated.
DeleteThis rather connects with my own experience. My wife is Spanish and she has two Arab matches
ReplyDeletevia her autosomal results, plus a Jewish one. And both Jews and Arabs were finally evicted from Spain over four hundred years ago. My mother from Northern Ireland has a match with an Icelandic lady and the latter only has Icelandic connections in her family tree.
It would be interesting to look at how increased generation span affected the figures. (25yrs is insufficient in west Cornwall women where married at 26+ and your ancestor is on average her middle child five+ years later). Would also be good to reflect on factors that affect recombination, e.g. maternal age and chromosome length. X chromosome in particular often does not recombine. I guess I was disappointed not to see these caveats in the calculations. Though the general point made is useful.
ReplyDeleteIt would be interesting to look at how increased generation span affected the figures. (25yrs is insufficient in west Cornwall women where married at 26+ and your ancestor is on average her middle child five+ years later). Would also be good to reflect on factors that affect recombination, e.g. maternal age and chromosome length. X chromosome in particular often does not recombine. I guess I was disappointed not to see these caveats in the calculations. Though the general point made is useful.
ReplyDeleteIam glad you posted this / I only started doing ny Ancestry in Feb 2021 when I received my dna results since then I've had them labels stapled on me had them spread so must bull that they get me banned off wikitree can you believe that a bunch of old people who claim to be professional genealogist branding 5his on someone behind the back I only found out by stumbling on to their conversation left on a post by the time I saw it it was already to late everything snowballed the amount of incest connebts and you dad's not your dad etc that's just the being of it .. I wish I had 9f seen this then / Ian still waiting for their evidence proving their claims / I've lots spark for it / theirs no point I understood the way you presented it way better than some other ones I've seen so thanks
ReplyDeleteI have a DNA match with someone on Ancestry, where we only have single segment being shared, 106cm, we can dismiss 4 generations of common ancestors, as one family migrated thousands of miles away in 1912, additionally, the daughter of the match has performed a DNA test on Ancestry, and matches 99cm with me, albeit, now in two segments, 92cm and 7cm. So surely this is sticky? My question is, if only 7cm are lost in a generation by such a large sized segment, what figures can be extrapolated from your modelling? Are the segments that are unwilling to be recombined?
ReplyDeleteCould you explain why "a 1 cM span of chromosome has a 99% chance of avoiding recombination per generation"? I don't think I understand the math.
ReplyDeleteThat's just the definition of a centi-Morgan, and why the unit of measurement has the prefix "centi-" in it. A cM span of a chromosome is a statistical unit specifically defined by having a 1/100 chance of a recombination event across it in a generation.
DeleteHave you calculated the odds of segments longer than 40 cM being passed down a given number of generations? That would be of interest to me and the unknown commentator a couple of lines above.
ReplyDeleteI should probably turn the calculation to a tool on the Borland Genetics site if other people are interested in this kind of thing. My next "programming marathon" for the site will begin in July and I'll put it on my list of ideas for new site content. Thanks!
DeleteThis comment has been removed by the author.
DeleteThanks Kevin. That would be great--there's no rush to respond. I tried running some numbers for a 47 cM shared segment (the size my dad shares with someone who I think could be a 4th cousin twice removed). This matches' ancestor did have 15 or 16 kids that may have had offspring. If I am reading the formula right, the chance of inheriting the 47 cM segment intact is 62.5% for each generation distant. So the odds of sharing a large segment are very low, but it's also hard to be confident in which generation the lines connect. I'm reading about a 37% chance that it is at the 4C2R level versus something more distant (adding up the rows below 0.57% until they get close to 0). However, we have other shared segments in the 30-40 cM range with folks who are 5th cousins of this match, so that would seem to push the odds of the closer relationship much higher again. I'm think I'm imagining a single large segment versus of the WATO calculator....
DeleteP(Shared 47cm segment) Steps Relationship
62.50% 1 sibling
39.06% 2 sibling 1R
24.41% 3 1C
15.26% 4 1C1R
9.54% 5 2C
5.96% 6 2C1R
3.73% 7 3C
2.33% 8 3C1R
1.46% 9 4C
0.91% 10 4C1R
0.57% 11 5C
0.36% 12 5C1R
0.22% 13 6C
0.14% 14 6C1R
0.09% 15 7C
0.05% 16 7C1R
0.03% 17 8C
0.02% 18 8C1R
0.01% 19 9C
0.01% 20 9C1R
0.01% 21 10C
0.00% 22 10C1R
You are going to be getting a ton of questions from me! I'll start with this sentence: "There are exactly three possibilities in our inheritance model: A) This 7 cM segment is passed intact from parent to child; B) the segment is not passed at all and instead the parent passes genetic material from his or her opposite parent to the child; or C) at least one recombination occurs across this span and the parent passes pieces of both copies of his or her chromosome across this span (say 4 cM from one grandparent and 3 cM from the other)." You are saying that your computer model has exactly 3 possibilities not that there are only 3 possibilities in reality right? The other possibilities is that the child receives somewhere between 37.5% and 74.99% of their DNA from each parent and that might mean getting about half of that 7 cm chunk or it could mean getting none of it or it could mean getting the whole thing so long as the end result is that 37.5% to 74.99% of their DNA came from that parent. So your computer model is only addressing the segments that are inherited unchanged?
ReplyDeleteYou're absolutely right that the overall genetic inheritance from each parent to a child is a complex process with many possible outcomes. However, when we focus on a single genetic segment and its inheritance through generations, we are indeed limited to three primary outcomes for that specific segment:
DeleteThe segment is inherited in full, without recombination.
The segment is not inherited at all because another segment from the alternate chromosome is chosen.
The segment undergoes recombination, and only a part (or parts) of it is inherited.
This model allows us to calculate the statistical likelihood of a segment being inherited in a particular way and to estimate its age based on the known rates of recombination, given the hindsight that the segment exists and is a certain fixed length in a descendant (as evident perhaps by match start/stop coordinates). While multiple recombination events can occur along a chromosome, our study is concerned with the inheritance of one specific segment and what it can tell us about our ancestry.
For the purposes of this analysis and the genetic tools we use, we're examining the inheritance of this single segment to make conclusions about its age (the most recent estimated date when that segment was cut to its current size). By doing so, we can infer information about the common ancestor from whom the segment was inherited.
I hope this explanation clarifies the focus of the model and the article. I should also point out that I wish to refine the probabilities by taking into account whether the segment in question is the ONLY segment shared with a match. Because if so, for example, as a first-order perturbation to our model, we can probably safely remove the <100 years wedge of the pie chart and renormalize the remaining percentages in the other wedges to add up to 100% since someone that closely related likely shares multiple segments. I say first order, because a higher order analysis would take into account additional information that might be available to us besides the length of our segment under study, such as the number of and statistical lengths of all shared segments with a match.
Thank you.
Delete"Next, let’s move on to 20 cM. Why 20 cM? Because that’s where Ancestry sets the threshold for shared matches. Here, the picture is very different, with quartiles (rounding to the nearest generation) being 0-25 years, 25-75 years, and 75-175 years, and >175 years. It’s no surprise that Ancestry considers this to be the fourth cousin boundary, since the average fourth cousin shares an ancestor born about 125 years prior, and since a 20 cM single-segment match is likely to be related at fourth cousin level or closer about 75.5% of the time! Here’s that same pie chart for a 20 cM segment:" Ancestry's white paper does not actually address 4th cousins except tangentially they are not listed in their recombination chart. I have been studying this because Ancestry is inconsistent in what relationship categories are assigned to people who share between 9 and 18 centimorgans - their main match page frustratingly misleadingly labels these matches as I think 6th to 8th cousin or 5th to distant cousin, but when you click on the individual match a page opens up with it's long list of category batches which are really just all the categories that belong to various degrees of relatedness. At the top ranked batch for 9-18 cM is always exactly as it should be the 9th degree of relatedness because there are exactly specifically and pointedly nine times DNA has to divide on the path between Tester 1 and Tester 2 in all those relationship categories. 4th cousin is in the top ranked batch and I have many known source documented 4th cousin matches between 9 and 18 centimorgans because that is .20% of their max centimorgans of 6600. 20 cM is an 8th degree relationship where DNA only divides 8 times. The only type of 4th cousin whose DNA divides 8 times is when they descend from a set of identical twins - basically you have a path that is 9 people long or 9 division instances long but dna does not divide for the twin sibling so therefore the 4th cousins end up sharing more DNA as if they are 1/2 3rd cousins. A 20 cm amount of shared DNA is not generating Ancestry estimates of 4th cousin (although if they added one more line to their meiosis chart in the white paper, 20 cM is what you'd get at 9 meiosis - problem with their meiosis chart is that they call people to their siblings 2 meiosis as if they are 25% 2nd degree relatives and they are not they are 1st degree relatives. Ancestry's white paper and their description of siblings as 2nd degree relatives differs from the definitions of 1st degree relationships at the Human Genome Project and also with Gina the federal act on genetic privacy.
ReplyDelete