What Are the Featured Functions of the DNAChron YTree?

Blog

October 16, 2024·

Differences Between DNAChron YTree and Other Trees

	DNAChron YTree	Others
Mutation Analysis Accuracy	An average of one mutation every 35 years, making it easier to clarify family relationships within rapidly expanding clusters and genealogies.	An average of one mutation every 80-144 years.
Confidence Interval	Provides specific probability distributions and confidence intervals.	Some paternal trees provide a 95% confidence interval; others do not.
True Confidence Level	Yes, calculated based on actual coverage.	Fixed age estimation based on 8.4M data length, not varying with samples. Currently, commercial sequencing generally exceeds 14M.
Age Estimation Accuracy	Up to an average of one mutation every 50 years, offering higher accuracy and stability.	Fixed at an average of 144.44 years per mutation for age estimation.
True Age Estimation	Yes	Does not consider relationships between upstream and downstream branches, leading to issues such as multiple mutations being estimated at zero years, zero years for consecutive branch estimates, and a single mutation estimated near a thousand years. These discrepancies from actual years result in inaccurate estimations.
Increased Age Estimation Accuracy with Sample Size	Yes	When the number of samples and branches reaches a certain level, further accuracy cannot be improved due to age estimation contradictions between upstream and downstream branches.

Featured Functions of the DNAChron YTree

High-Precision Mutation Analysis: On average, each branch has 100% more mutations than traditional YTree products. Based on existing data, we can help you identify more private mutations. In clustered branches, we can pinpoint more mergeable branches to clarify inter-branch relationships. Newly discovered mutations will be named with a C prefix (chronicle).
Pure T2T Tree: Utilizing the latest T2Tv2 reference sequence, combined with DNAChron’s unique high-precision mutation analysis algorithm, we can tackle tens of millions of bases that traditional analysis methods struggle with, effectively doubling the number of mutations, resulting in a revolutionary product. To achieve this, we will re-align all sequencing data (BAM/CRAM/FASTQ) for free to the T2T reference sequence.
High-Precision Age Estimation: We provide a probability distribution chart for different estimated ages, improving estimation accuracy by 3 to 5 times.
True Age Estimation: When processing continuous branch age estimation, there will be no contradictions between upstream and downstream branch age estimates or consistency issues with continuous branches. We avoid severe deviations in age differences and mutation counts between branches. This resolves the traditional algorithm’s systemic bias of underestimating ages as the number of branches increases. The more samples we have, the more accurate the age estimation.
We provide matching between historical versions of ISOGG numbers and the DNAChron YTree, facilitating a quick understanding of the upstream branches of each branch, and helping to find the correspondence between old ISOGG numbers from extensive literature and the current branches.
For samples with insufficient sequencing coverage, we can show potential merges with other branches and even possible affiliations with downstream branches of a sibling branch. This represents your Tree Uncertainty position.
Register to browse the analysis results and raw data of all public research samples.
Rapid and standardized analysis process. You can complete the tree placement within 2 to 5 working days after data import.
Supports importing paternal analysis results, including formats like txt, csv, and vcf.
Provides features for mutation queries, branch queries, ISOGG number queries, browsing raw data, browsing reference sequences, and publishing personal information.

About DNAChron - The Origin of Our Name

Chronicle by DNA
At DNAChron, we believe that every individual’s genetic makeup tells a unique story. Our name reflects our commitment to documenting history through genetics, allowing us to explore and celebrate the distinct narratives that shape who we are. Join us on this journey of discovery, where your DNA becomes a key to unlocking your ancestral heritage.

High-Precision Mutation Analysis

The Role of Mutation Analysis Accuracy in Ancestry Analysis

Mutations are not just related to sequencing. The Y chromosome, due to its unique independent inheritance pattern, has produced many low-complexity or highly repetitive regions that are difficult to analyze. In particular, the new T2T reference sequence has added mostly such regions. By exploring new analytical algorithms, we can discover many new mutations.

1. The Higher the Accuracy of Mutation Analysis, the More Refined the YTree Can Be

Genetic mutations are a typical Poisson process[9]. The probability distribution of the years required for a mutation to occur is shown below:

Possion!

The horizontal axis represents years, while the vertical axis represents probability. This distribution is based on an average of one mutation every 100 years.

Most mutations occur near the mean, but it is also possible for a long time to pass before one occurs. Based on sequencing coverage and the accuracy of mutation analysis (traditional algorithms), the actual interval between paternal mutations is about 4 to 6 generations.

Therefore, while there may be branches on the paternal tree where mutations occur within a single generation, most branches will show no differences across several generations, and in extreme cases, there may be no differences for over ten generations. This results in different generations being spread across the same paternal branch downstream within family branch clusters.

For example:

Assume a family consisting of 4 brothers and 2 descendants undergoing sequencing:

Actual kinship!

Actual kinship

Assuming the father in the diagram has no mutation, the results on the YTree are as follows:

Kinship displayed on the YTree!

Kinship displayed on the paternal tree

People from different generations, such as uncles and nephews, are spread across the same paternal branch downstream. If a mutation occurs on average every 4 to 6 generations, this phenomenon will be even more pronounced, with individuals potentially spanning many generations hanging on the same branch.

The accuracy of tree structure has significant implications for branches that retain many descendants. This is particularly impactful for modern families or ancient rapidly expanding clusters.

The DNAChron mutation analysis algorithm, when utilizing whole genome data, can achieve an average of one mutation every 35 years, approximately equivalent to 1.5 generations, significantly enhancing the accuracy of tree structures.

For instance, the fully spread R-P312 branch in traditional algorithms presents a clear hierarchical structure in the DNAChron tree:

R-P312!

2. The higher the mutation analysis accuracy, the more private mutations

Private mutations are gene markers unique to an individual, distinct from others. The higher the accuracy of mutation analysis, the more private mutations can be identified and covered.

DNAChron Mutation Analysis Algorithm Technical Specifications

	DNAChron	Others
Number of analyzable bases	39 million	Unknown, estimated to be between 8 million and 15 million in other products
Types of analyzable mutations	SNP, INDEL, MNP/complex	Primarily SNP, some products may analyze INDEL

Currently, the average number of mutations in the DNAChron T2T YTree is twice that of other YTrees.

Unique Features of the DNAChron Mutation Analysis Algorithm

The core principle of the DNAChron algorithm is to maximize mutation analysis accuracy and utilize as many mutations as possible.

No Preset Regions, No Preset Mutation Types

It is well known that the Y chromosome’s uniqueness lies in its lack of recombination with other chromosomes. This characteristic has led to the accumulation of many low-complexity sequences in certain regions over the course of evolution, including repetitive sequences, multi-copy regions, and palindromic regions. These highly repetitive regions pose significant challenges for genetic analysis. To address this, researchers traditionally divide the Y chromosome into different sequencing regions and exclude the difficult-to-analyze parts, focusing only on the high-complexity regions for in-depth study. Additionally, in the mutation screening process, they adopt stricter standards, mainly focusing on single nucleotide polymorphisms (SNPs). While this refined approach reduces analytical complexity, it also sacrifices some degree of accuracy. Most current algorithms, and scientific institutions, can reliably analyze only 15 million base pairs, and in many cases, the actual analyzed range is even smaller—between 8 to 10 million base pairs. However, given that the Y chromosome in the T2T reference sequence contains about 60 million base pairs, analyzing only a portion leads to significant loss of information.

When designing the DNAChron algorithm, we adhered to the core principle of pursuing the highest possible accuracy. The algorithm does not preset any region or mutation type; instead, it directly searches for reliable mutations across the entire 60 million base pairs of the Y chromosome. After extensive exploration, we included 39 million base pairs in the analysis. The remaining regions not covered are largely those that current second-generation sequencing technologies struggle to reach. Thus, we have effectively utilized all analyzable regions.

Of course, the challenges we faced during this process increased exponentially. However, the DNAChron algorithm successfully overcame these challenges, achieving a revolutionary level of accuracy, with an average of 1 mutation every 35 years.

You can download our 39M bedfile here -> dnachron.t2t.chrY.bed.gz Github

Unique Recombination-Aware Mutation Analysis Algorithm

A unique feature of the Y chromosome is that large parts of it allow internal recombination, which is crucial for maintaining genetic stability on the Y chromosome. However, this internal recombination presents challenges, as it can lead to unstable mutation states. Mutations may revert to their unmutated states through recombination, complicating the construction of the YTree. This instability is inherent, and even advanced technologies like the combined third-generation sequencing and second-generation sequencing used to produce the T2T reference sequence cannot fully resolve this issue. To tackle this challenge, traditional algorithms typically avoid recombinable regions, as they struggle to provide reliable mutation data.

DNAChron’s algorithm innovatively introduces recombination probability analysis, stringent quality control, and anomaly-handling mechanisms to overcome this obstacle. We not only make full use of these traditionally avoided regions but also ensure the reliability of the mutation data obtained. By effectively leveraging recombinable regions, we have significantly enhanced the accuracy of mutation analysis, with mutations from these regions now accounting for one-third of our total mutations.

DNAChron Naming of Newly Discovered Mutations

Newly discovered mutations will be named starting with C (Chronicle), representing DNAChron.

High-Precision Genetic Age Estimation

Key Features of DNAChron’s High-Precision Age Estimation Algorithm

High precision in age estimation, with narrower confidence intervals.
Provides the probability for different age estimates, allowing for more accurate high-probability ranges, improving effective age estimation precision by 3 to 5 times.

E-BY4877!

Solves the causality problem natively, ensuring that downstream branches’ age estimates never exceed those of upstream branches[10], avoiding the need to artificially adjust estimates for logical consistency. This greatly improves the accuracy of age estimates for clades or continuously splitting branches, as well as modern family branches.
Real precision based on actual sequencing coverage, as indicated by the number of bases on the right side of the Age Estimation graph. The higher the sequencing coverage[11] and the more samples available, the narrower the confidence interval.
Age estimation precision is transferable. When a branch achieves high-precision age estimation, its upstream and downstream branches also benefit, improving overall accuracy.

How DNAChron Improves Age Estimation Precision

The improvement in age estimation precision comes primarily from three aspects:

1. High-Precision Mutation Analysis Algorithm

The more detailed the tree, the closer it gets to the true branching relationships, and the more accurate the estimate.
More mutations can be identified and used in age estimation.

2. Utilizing the full 39 million base pairs for mutation age estimation, including SNPs, INDELs, and MNP/complex mutations, doubling the base accuracy of age estimation

Traditional age estimation algorithms, to minimize interference from mutation analysis errors and simplify calculations, typically do not consider differences in sequencing coverage. Most only use around 8 million base pairs of SNP mutations, which are the most stable and detectable by sequencing companies.

Genetic mutations follow a Poisson process[9]. According to Poisson distribution confidence interval estimation methods, the higher the average mutation rate, the narrower the confidence interval, leading to higher precision in age estimation. Limiting the selection of base pairs and mutation types effectively reduces the mutation rate.

DNAChron’s YTree uses a high-precision mutation analysis method, resulting in a higher average mutation rate than other products. It only excludes recombinable mutations and imposes no other restrictions on mutation types, thus effectively increasing the base accuracy of age estimation.

3. Innovation in Analysis Algorithms

Traditional Age Estimation Algorithms work roughly as follows:

Tree I! Tree I

Take a simple branch from the diagram as an example. The number of mutations from sample A and sample B to the parent branch is calculated separately, and based on the average mutation rate (mutations per year), each sample’s time is estimated. The final parent branch age is the average of these estimates.

Tree II! Tree II

When the number of branches increases, the same method is applied to calculate the age of all downstream branches, and their average is taken to estimate the parent branch’s age.

Tree III! Tree III

When the branch structure becomes complex, as shown in the diagram, traditional algorithms face challenges.

Suppose Branch F diverged shortly after the parent branch, and its downstream branches G and H have many mutations, making F’s age estimate larger. Averaging the age estimates of A, B, D, and F results in an estimate close to the actual time. In such cases, the parent branch’s age might be smaller than F’s, a phenomenon known as a causal error, where a descendant branch’s estimated age precedes its ancestor’s, violating the principle of causality. This is common in rapidly diverging large clusters and modern family clusters, greatly affecting age estimation accuracy. Other YTree products usually address this by either forcing downstream branches’ age estimates to equal the upstream ones or setting the downstream branches’ age estimates 10-20 years lower than the upstream branches to fit biological constraints. Both methods involve manually adjusting the age estimates rather than producing reasonable results.

Another challenge lies in calculating the confidence interval. In Trees I and II, the confidence intervals are relatively easy to calculate. But in Tree III, Branches D and F have downstream branches with narrower confidence intervals, while A and B’s single-sample intervals are wider. Other YTree algorithms remain unclear, but a genuine method must pass the high-confidence data from D and F to the parent branch’s age estimation. Otherwise, no matter how much downstream data there is, it won’t improve the upstream branches’ age estimation accuracy.

DNAChron’s Age Estimation Algorithm can utilize more information to transfer confidence between upstream and downstream branches.

Tree IV! Tree IV

Looking at Tree IV’s binary structure, when calculating the parent branch’s age, in addition to using A and B’s mutation counts, we can also subtract the time from the parent branch to Branch F to obtain another independent piece of age information, significantly improving the accuracy.

The age derived from Branch F and A and B must be independent events to combine them in the calculation. To ensure independence, only data from Branch C is used when estimating F’s age.

DNAChron’s algorithm utilizes the rest of the YTree, excluding the “parent branch,” to calculate F’s age, then subtracts the parent branch to F time to derive a third age estimate for the parent branch, maximizing accuracy.

Tree III! Tree III

For complex structures like Tree III, where Branch F diverged shortly after the parent branch and its downstream branches G and H have many mutations, traditional algorithms can easily lead to causal errors.

Branch F’s substructure is also a binary tree. By subtracting the intermediate ages from upstream branches, we can obtain a third age estimate.

When the time intervals are short, the age estimation error is smaller, and the confidence is higher. Thus, the high-precision age of the parent branch, when subtracted by the time from the parent branch to F, still maintains high accuracy, often surpassing the downstream branches G and H.

The age estimate of Branch F derived this way is significantly more accurate than traditional algorithms and eliminates the possibility of causal errors.

Similarly, Branch D and its upstream branches’ age estimation accuracy can also be improved in this manner.

This approach can be extended across the entire YTree, layer by layer, resulting in today’s high-precision age estimation results.

References

[1] The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes

[2] Generation of high-resolution a priori Y-chromosome phylogenies using “next-generation” sequencing data

[3] The Y-chromosome point mutation rate in humans

[4] Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data

[5] Improved Models of Coalescence Ages of Y-DNA Haplogroups

[6] The study of human Y chromosome variation through ancient DNA

[7] Present-Day DNA Contamination in Ancient DNA Datasets

[8] Computational challenges in the analysis of ancient DNA

[9] Poisson process https://en.wikipedia.org/wiki/Poisson_point_process

[10] Cancer samples and other non-normal samples, ancient DNA samples, ultra-low coverage samples, and samples in preliminary analysis can lead to their upstream branches deviating from the overall aging estimation algorithm. These branches may experience issues of over- and under-estimation of age (causal issues).

[11] The sequencing coverage for age estimation refers to the number of bases sequenced at least twice in its downstream branches. Because as long as it is sequenced twice, it can be known whether there was a mutation at that base when the branch diverged.

Last updated on October 16, 2024