Chapter 2: Reading the Code — Understanding Genetics in the Age of AI

2.1: The Human Genome Project — Biology's Moon Shot

On June 26, 2000, President Bill Clinton stood at a White House podium and declared, "Today we are learning the language in which God created life." He was announcing the near-completion of the Human Genome Project, the largest collaborative biological research effort in history. Launched in 1990, the project involved scientists across six countries — the US, UK, France, Germany, Japan, and China — working through roughly twenty institutions, took thirteen years, and cost approximately 2.7 billion dollars. Its goal: read every one of the three billion letters in the human genome and make the data freely available to the world.

The technology that powered it was Sanger sequencing, developed by Frederick Sanger in 1977. Sanger sequencing is accurate and reliable but slow and expensive. Each run reads about a thousand letters. Sequencing an entire genome this way is like reading a three-billion-word encyclopedia one sentence at a time. The first complete human genome cost hundreds of millions of dollars in direct sequencing costs.

The cost curve since then has been dramatic. By 2014, sequencing a human genome had dropped to about one thousand dollars. By 2024, companies like Illumina and Element Biosciences were pushing toward two hundred dollars per genome. This decline has been faster than Moore's Law. During certain periods, sequencing costs halved every few months. When sequencing a genome costs less than a pair of running shoes, the question shifts from "Can we read the code?" to "What do we do with the information?"

Genome sequencing costs have dropped more than 13-million-fold since 2000, outpacing Moore's Law by orders of magnitude. Hover (or tap) each data point for context. The $200 genome made Rosie's story financially possible.

2.2: How Modern Sequencing Works

The technology that replaced Sanger sequencing is called next-generation sequencing, or NGS. The dominant platform, made by Illumina, uses massively parallel sequencing. Instead of reading one stretch of DNA at a time, you shatter the genome into millions of tiny fragments, attach them to a glass surface called a flow cell, and read them all simultaneously using fluorescent cameras. Each fragment is about one hundred and fifty letters long. A computer takes the overlapping fragments and stitches them back together, reconstructing the original sequence. It is like photocopying a book a million times, cutting each copy into random sentences, and using a computer to reconstruct the order by matching overlapping phrases.

A newer generation, third-generation or long-read sequencing, takes a different approach. Pacific Biosciences (PacBio) can read individual DNA molecules ten thousand to one hundred thousand letters long in a single pass. Oxford Nanopore Technologies' MinION, roughly the size of a USB stick, reads DNA by threading it through a molecular pore and measuring electrical current changes. Long-read sequencing is particularly valuable for resolving complex genomic regions that short reads cannot untangle, like repetitive sequences or structural rearrangements common in cancer.

The horizon is a genome for under one hundred dollars, and companies like Element Biosciences are racing to get there. When a complete genome costs less than a pair of running shoes, routine genomic profiling at birth becomes feasible. Pharmacogenomics, tailoring drug prescriptions to your genetic makeup, becomes standard. For cancer patients, sequencing a tumor to identify its specific mutations becomes a baseline expectation, not a luxury.

2.3: What Rosie's Sequencing Revealed

When Conyngham brought Rosie's case to UNSW researchers, the first step was conceptually straightforward but technically demanding: sequence both the tumor and normal tissue, then compare. This paired tumor-normal sequencing is standard in cancer genomics. The cost was three thousand dollars at the UNSW Ramaciotti Centre for Genomics, unthinkable a decade ago, and likely a fraction of that within a few more years.

The raw data from sequencing is a massive text file in FASTQ format, containing billions of short DNA reads along with quality scores for each letter. The next step is computational: align those reads to a reference genome (the dog genome, in this case), then compare tumor reads to normal reads. This process, variant calling, uses tools like GATK's Mutect2 and Strelka. These algorithms distinguish real somatic mutations, changes in cancer cells only, from sequencing errors, germline variants, and noise. The output is a VCF file: the genetic fingerprint of Rosie's cancer.

But a mutation list is just the start. Rosie's tumor had thousands of mutations, and the question was: which produce proteins the immune system can recognize and attack? Not all mutations yield neoantigens. The mutated protein has to be expressed, chopped into the right-sized fragments, and displayed on the cell surface where T-cells can see it. Finding those needles in the haystack requires a computational pipeline drawing on protein structure prediction, immune binding algorithms, and careful ranking. AI became not just helpful but essential.

Key Takeaways

The Human Genome Project cost $2.7 billion in 2000; sequencing now costs under $200 — a 13-million-fold price drop in 25 years.
NGS works by shattering DNA into millions of fragments, reading them all in parallel, then reassembling them computationally.
Paired tumor-normal sequencing compares cancer DNA to healthy DNA to identify somatic mutations unique to the tumor.
The output is a VCF file — the starting material for all neoantigen prediction and vaccine design steps.

Reading the Code — Genome Sequencing

2.1: The Human Genome Project — Biology's Moon Shot

2.2: How Modern Sequencing Works

2.3: What Rosie's Sequencing Revealed

Key Takeaways