First complete assembly of human X chromosome

First complete assembly of human X chromosome

Although the current human reference genome is the most accurate and complete vertebrate genome ever produced, there are still gaps in the DNA sequence, even after two decades of improvements. Now, for the first time, scientists have determined the complete sequence of a human chromosome from one end to the other ('telomere to telomere') with no gaps and an unprecedented level of accuracy.

The publication in Nature is a landmark achievement for genomics researchers. Lead author said the project was made possible by new sequencing technologies that enable "ultra-long reads," such as the nanopore sequencing technology.

Repetitive DNA sequences are common throughout the genome and have always posed a challenge for sequencing because most technologies produce relatively short "reads" of the sequence, which then have to be pieced together like a jigsaw puzzle to assemble the genome. Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are.

"These repeat-rich sequences were once deemed intractable, but now we've made leaps and bounds in sequencing technology," the lead said. "With nanopore sequencing, we get ultra-long reads of hundreds of thousands of base pairs that can span an entire repeat region, so that bypasses some of the challenges."

Filling in the remaining gaps in the human genome sequence opens up new regions of the genome where researchers can search for associations between sequence variations and disease and for other clues to important questions about human biology and evolution.

"We're starting to find that some of these regions where there were gaps in the reference sequence are actually among the richest for variation in human populations, so we've been missing a lot of information that could be important to understanding human biology and disease," the lead said.

To finish the X chromosome, the team had to manually resolve several gaps in the sequence. Two segmental duplications were resolved with ultra-long nanopore reads that completely spanned the repeats and were uniquely anchored on either side. The remaining break was at the centromere, a notoriously difficult region of repetitive DNA found in every chromosome.

In the X chromosome, the centromere encompasses a region of highly repetitive DNA spanning 3.1 million base pairs (the bases A, C, T, and G form pairs in the DNA double helix and encode genetic information in their sequence). The team was able to identify variants within the repeat sequence to serve as markers, which they used to align the long reads and connect them together to span the entire centromere.

"For me, the idea that we can put together a 3-megabase-size tandem repeat is just mind-blowing. We can now reach these repeat regions covering millions of bases that were previously thought intractable," the lead said.

The next step was a polishing strategy using data from multiple sequencing technologies to ensure the accuracy of every base in the sequence.

We used an iterative process over three different sequencing platforms to polish the sequence and reach a high level of accuracy," the lead explained. "The unique markers provide an anchoring system for the ultra-long reads, and once you anchor the reads, you can use multiple data sets to call each base."

Nanopore sequencing, in addition to providing ultra-long reads, can also detect bases that have been modified by methylation, an "epigenetic" change that does not alter the sequence but has important effects on DNA structure and gene expression. By mapping patterns of methylation on the X chromosome, the team was able to confirm previous observations and reveal some intriguing trends in methylation patterns within the centromere.

The new human genome sequence, derived from a human cell line called CHM13, closes many gaps in the current reference genome, known as Genome Reference Consortium build 38 (GRCh38).