The human genome

NOTE

I planned to publish this page on September 11. Today’s assassination has destroyed any joy from this project.

Dedicated to the memory of Charlie Kirk.

Introduction

The plummeting cost of high-depth whole genome sequencing means that amateurs like me can play with the raw data. In the pursuit of a different personal hobby project, in September 2024 Sally and I each purchased 100X Whole Genome Sequencing from Nebula Genomics. We received the data about six weeks later. I ultimately ramped up that project in June 2025.

In August 2025, as a diversion, I wandered from the goal of that project and slightly modified the method I was using so that I could play with and better understand the raw sequencing data itself. The results that I got the next day, August 18, were so surprising and baffling that within 24 hours I started writing up a paper, even though I had no idea what the explanation was.

I planned to publish the completed paper here on September 11, 2025. I found that our haploid genome is twice as long as the reference human genome, with half of our sequences missing from the reference genome. I speculated that this is true for all humans.

The Shadow Genome Project

Once I realized that half of our genome is missing from the reference human genome, I knew that it would be important to map out what I’m calling the “shadow genome.” I continued to build out and refine the rudimentary algorithm described in the above paper to grow out the sequences one base at a time, not into full assemblies, but at least into sequences long enough to demonstrate that my analysis was not flawed, and as a proof of concept for a more extensive study. I also thought of new ways to establish whether I had made a mistake when concluding that the haploid genome is twice as long as we think it to be.

I am providing here supplementary papers summarizing the progress of that work:

Sally and I pondered the question of whether we should really release our genomic sequences publicly. It’s normally advised against, since insurance companies or potential employers could use the information in them to discriminate against us. But Sally and I decided that at this point in our lives there is little harm that these companies could now inflict upon us. Go ahead. Make our day.

At the current time, I do not have the actual genomic sequence files computed yet. Please check back soon.

We therefore release the three following custom-format text files for public use under the MIT-0 License:

The format is a modified version of the FASTA format:

As the three files above arguably provide much higher-fidelity intelligence on our genomes than the raw data, we are more than happy to provide the actual raw data files that we received from Nebula Genomics to anyone who wants to replicate my analysis. But as they're 0.82 TB in total, it's not feasible for me to serve them from this site. If anyone is willing to host them, please contact me.

Code

I performed the analysis described in the paper using simple code that I wrote in ANSI C, contained entirely in the following archive file:

There are 16 programs included:

nebula_fastq_to_genome_file
Write the unambiguous bases from a pair of Nebula files to my GenomeFile format.
nebula_fastq_dump
Dump the unambiguous bases from a pair of Nebula files into a text file of ACGT bases, with a newline only after each sequence.
genome_file_dump
Dump the bases in one of my GenomeFile files in the same way.
genome_file_metadata
Print the metadata stored at the start of a GenomeFile, and optionally the number of bases in each fragment.
genome_file_head
Like the head utility, create a subset of a given GenomeFile that contains just the first n bases.
genome_file_freqs
Compute the frequencies of all 16-base needles for a GenomeFile and save the results in a GenomeFreqs file.
ref_fasta_freqs_positions
Compute the frequencies and positions of reference genome needles, saving to GenomeFreqs and RefGenomePositions files.
genome_freqs_metadata
Show some basic metadata for a GenomeFreqs file
genome_freqs_freqs
Compute the frequency table of the needle frequencies less than 255 in a GenomeFreqs file.
genome_freqs_unigrams
Compute the unigram frequencies from the needle frequencies in a GenomeFreqs file.
genome_freqs_bigrams
Compute the bigram frequencies from the needle frequencies in a GenomeFreqs file.
genome_freqs_joint_freqs
Compute the joint frequency table for the needle frequencies less than 255 in two GenomeFreqs files.
genome_freqs_joint_freqs_shadow
Same, filtering to those in the shadow genome or the reference genome.
genome_freqs_combine
Combine two GenomeFreqs files into a single file with the sum of frequencies for each needle.
nebula_fastq_biases
Compute the frequency of each base at each read position for each of a pair of Nebula files.
grow_genome
Grow the genome using the method described in my paper.

There are also 69 executables of unit tests and death tests for the core libraries (not the genome-specific libraries) provided.

Instructions for building the code are contained in the _README file within the archive.

Disclaimers

This page describes personal hobby research that I undertook in 2025. All opinions expressed herein are mine alone. All code provided here is from my personal codebase, and all code and data is supplied under the MIT-0 License.