The human genome

Introduction

The plummeting cost of high-depth whole genome sequencing means that amateurs like me can play with the raw data. In the pursuit of a different personal hobby project, in September 2024 Sally and I each purchased 100X Whole Genome Sequencing from Nebula Genomics. We received the data about six weeks later. I ultimately ramped up that project in June 2025.

In August 2025, as a diversion, I wandered from the goal of that project and slightly modified the method I was using so that I could play with and better understand the raw sequencing data itself. The results that I got the next day, August 18, were so surprising and baffling that within 24 hours I started writing up a paper, even though I had no idea what the explanation was.

I planned to publish the completed paper here at 8:46 a.m. EDT on September 11, 2025, but the devastating announcement by our President at 4:40 p.m. EDT on September 10 that Charlie Kirk had been assassinated ended my ability to do anything more.

I found that our haploid genome is twice as long as the reference human genome, with half of our sequences missing from the reference genome. I speculated that this is true for all humans.

An amateur analysis of the structure of the human genome (September 10, 2025)

The Shadow Genome Project

Once I realized that half of our genome is missing from the reference human genome, I knew that it would be important to map out what I’m calling the “shadow genome.” I continued to build out and refine the rudimentary algorithm described in the above paper to grow out the sequences one base at a time, not into full assemblies, but at least into sequences long enough to demonstrate that my analysis was not flawed, and as a proof of concept for a more extensive study. I also thought of new ways to establish whether I had made a mistake when concluding that the haploid genome is twice as long as we think it to be.

I am providing here supplementary papers summarizing the progress of that work.

In the first progress report I started refining my estimates of the number of bases in our diploid genomes, and had a preliminary estimate of 11.1 billion bases, but that work was cut short by the assassination:

The Shadow Genome Project: progress report 1 (September 10, 2025)

Some days later I picked this work back up again, and concluded that the above work in “PR1” (Progress Report 1) was unclear. I withdrew the estimate of 11.1 billion bases. Returning to more reliable methods, I determined that our genomes are about 11.5 billion bases long; i.e. about 82% longer than the reference genome, which therefore is missing about 45%:

The Shadow Genome Project: progress report 2 (September 16, 2025)

I then turned my attention to the other extreme of the raw sequencing data: the highly-repetitive bulk sequences in the genome. I found that there was more that I could do with them than I originally believed:

The Shadow Genome Project: progress report 3 (September 21, 2025)

I wrote up another PR on a Markov chain analysis, but it got cut a little short with a surprise birthday party that Sally sprung on me:

The Shadow Genome Project: progress report 4 (September 24, 2025)

(I am working on PR5, to fill in some of the gaps caused by that interruption to the end of PR4, but I won't have it here until after we return from vacation.)

Sally and I pondered the question of whether we should really release our genomic sequences publicly. It’s normally advised against, since insurance companies or potential employers could use the information in them to discriminate against us. But Sally and I decided that at this point in our lives there is little harm that these companies could now inflict upon us. Go ahead. Make our day.

At the current time, I do not have the actual genomic sequence files computed yet. Please check back soon.

We therefore release the three following custom-format text files for public use under the MIT-0 License:

smc.jpcgenome.gz (XXX KB; MD5: XXX): Sally’s data.
jpc.jpcgenome.gz (XXX KB; MD5: XXX): My data.
smc-jpc.jpcgenome.gz (XXX KB; MD5: XXX): Our combined data.

The format is a modified version of the FASTA format:

A line starting with # is a comment, and can be ignored.
A line starting with > marks the start of a sequence. The rest of the line will usually contain metadata, but can be ignored.
Lines are no longer than 80 characters, excluding the newline, and newlines other than those described above must be ignored.
Only the base codes A, C, G, and T are used.
A pair of square brackets signifies that variant subsequences were seen at that position of the sequence. Inside the square brackets is a list of two or more variants, in alphabetical order. Each variant begins with the sequence of bases in parentheses, followed by the number of haploid genomes in which it was seen, in decimal, in curly braces (which must sum to 2 or 4 for our data files above, but which in general will sum to twice the number of people whose data has been combined). Examples:
- ACGC[(A){1}(T){3}]GTCA signifies that after ACGC the variant A was seen in one haploid genome and the variant T in three haploid genomes. All four genomes then continue with GTCA. (E.g., a single-nucleotide variant for our combined data.)
- [(A){2843}(C){8467}] signifies that the variant A was seen in 2,843 haploid genomes and the variant C was seen in 8,467 haploid genomes. (E.g., a single-nucleotide variant for the combined data of 5,655 people.)
- [(){1}(ACG){1}] signifies that there were no bases between the surrounding parts of the sequence in one haploid genome, and ACG was seen in the other haploid genome. (E.g., an indel in Sally’s data.)
- [(){18374}(CAGCAG){75397}(TAT){54917}] signifies that an empty subsequence was seen in 18,374 haploid genomes, CAGCAG was seen in 75,397 haploid genomes, and TAT was seen in 54,917 haploid genomes. (E.g., a more complicated set of variants for 74,344 people.)
A variant specification in square brackets can be broken by a newline character, because newline characters are ignored.

As the three files above arguably provide much higher-fidelity intelligence on our genomes than the raw data, we are more than happy to provide the actual raw data files that we received from Nebula Genomics to anyone who wants to replicate my analysis. But as they're 0.82 TB in total, it's not feasible for me to serve them from this site. If anyone is willing to host them, please contact me.

Code

I performed the analysis described in the paper using simple code that I wrote in ANSI C, contained entirely in the following archive file:

genome.tar.gz (263 KB; MD5: 243ed6dfa35267ccfbc2ced4e7c5daf0)

There are 23 programs included:

nebula_fastq_to_genome_file: Write the unambiguous bases from a pair of Nebula files to my GenomeFile format.
nebula_fastq_dump: Dump the unambiguous bases from a pair of Nebula files into a text file of ACGT bases, with a newline only after each sequence.
genome_file_dump: Dump the bases in one of my GenomeFile files in the same way.
genome_file_metadata: Print the metadata stored at the start of a GenomeFile, and optionally the number of bases in each fragment.
genome_file_head: Like the head utility, create a subset of a given GenomeFile that contains just the first n bases.
genome_file_freqs: Compute the frequencies of all 16-base needles for a GenomeFile and save the results in a GenomeFreqs file.
ref_fasta_freqs_positions: Compute the frequencies and positions of reference genome needles, saving to GenomeFreqs and RefGenomePositions files.
genome_freqs_metadata: Show some basic metadata for a GenomeFreqs file
genome_freqs_freqs: Compute the frequency table of the needle frequencies less than 255 in a GenomeFreqs file.
genome_freqs_unigrams: Compute the unigram frequencies from the needle frequencies in a GenomeFreqs file.
genome_freqs_bigrams: Compute the bigram frequencies from the needle frequencies in a GenomeFreqs file.
genome_freqs_joint_freqs: Compute the joint frequency table for the needle frequencies less than 255 in two GenomeFreqs files.
genome_freqs_joint_freqs_shadow: Same, filtering to those in the shadow genome or the reference genome.
genome_freqs_combine: Combine two GenomeFreqs files into a single file with the sum of frequencies for each needle.
nebula_fastq_biases: Compute the frequency of each base at each read position for each of a pair of Nebula files.
grow_genome: Grow the genome using the method described in my paper.
genome_freqs_bias: Compute the frequency-dependent bias in a GenomeFreqs file relative to a given GenomeFreqs file, per Progress Report 1.
genome_freqs_freqs_shadow: The same as genome_freqs_freqs but filtering to needles in the reference genome.
ref_fasta_to_genome_file: Write the unambiguous bases from the reference human genome to my own GenomeFile format created just for this experiment.
ref_fasta_dump: Dump the unambiguous bases from the reference human genome into a text file to compare with that from the GenomeFile.
genome_freqs_most_popular: Extract the list of the highest-frequency needles in a GenomeFreqs file and their frequencies, as described in PR3.
genome_freqs_solid_ropes: Extract the frequencies of needles relevant for “solid ropes,” i.e. runs of the same repeated base, as described in PR3.
genome_freqs_singles_markov: Compute the Markov chain transition probabilities for singles in a GenomeFreqs file being singles in another such file or not.

There are also 69 executables of unit tests and death tests for the core libraries (not the genome-specific libraries) provided.

Instructions for building the code are contained in the _README file within the archive.

Disclaimers

This page describes personal hobby research that I undertook in 2025. All opinions expressed herein are mine alone. All code provided here is from my personal codebase, and all code and data is supplied under the MIT-0 License.