The plummeting cost of high-depth whole genome sequencing means that amateurs like me can play with the raw data. In the pursuit of a different personal hobby project, in September 2024 Sally and I each purchased 100X Whole Genome Sequencing from Nebula Genomics. We received the data about six weeks later. I ultimately ramped up that project in June 2025.
In August 2025, as a diversion, I wandered from the goal of that project and slightly modified the method I was using so that I could play with and better understand the raw sequencing data itself. The results that I got the next day, August 18, were so surprising and baffling that within 24 hours I started writing up a paper, even though I had no idea what the explanation was.
I planned to publish the completed paper here at 8:46 a.m. EDT on September 11, 2025, but the devastating announcement by our President at 4:40 p.m. EDT on September 10 that Charlie Kirk had been assassinated ended my ability to do anything more.
I found that our haploid genome is twice as long as the reference human genome, with half of our sequences missing from the reference genome. I speculated that this is true for all humans.
Once I realized that half of our genome is missing from the reference human genome, I knew that it would be important to map out what I’m calling the “shadow genome.” I continued to build out and refine the rudimentary algorithm described in the above paper to grow out the sequences one base at a time, not into full assemblies, but at least into sequences long enough to demonstrate that my analysis was not flawed, and as a proof of concept for a more extensive study. I also thought of new ways to establish whether I had made a mistake when concluding that the haploid genome is twice as long as we think it to be.
I am providing here supplementary papers summarizing the progress of that work.
In the first progress report I started refining my estimates of the number of bases in our diploid genomes, and had a preliminary estimate of 11.1 billion bases, but that work was cut short by the assassination:
Some days later I picked this work back up again, and concluded that the above work in “PR1” (Progress Report 1) was unclear. I withdrew the estimate of 11.1 billion bases. Returning to more reliable methods, I determined that our genomes are about 11.5 billion bases long; i.e. about 82% longer than the reference genome, which therefore is missing about 45%:
I then turned my attention to the other extreme of the raw sequencing data: the highly-repetitive bulk sequences in the genome. I found that there was more that I could do with them than I originally believed:
I wrote up another PR on a Markov chain analysis, but it got cut a little short with a surprise birthday party that Sally sprung on me:
(I am working on PR5, to fill in some of the gaps caused by that interruption to the end of PR4, but I won't have it here until after we return from vacation.)
Sally and I pondered the question of whether we should really release our genomic sequences publicly. It’s normally advised against, since insurance companies or potential employers could use the information in them to discriminate against us. But Sally and I decided that at this point in our lives there is little harm that these companies could now inflict upon us. Go ahead. Make our day.
At the current time, I do not have the actual genomic sequence files computed yet. Please check back soon.
We therefore release the three following custom-format text files for public use under the MIT-0 License:
The format is a modified version of the FASTA format:
As the three files above arguably provide much higher-fidelity intelligence on our genomes than the raw data, we are more than happy to provide the actual raw data files that we received from Nebula Genomics to anyone who wants to replicate my analysis. But as they're 0.82 TB in total, it's not feasible for me to serve them from this site. If anyone is willing to host them, please contact me.
I performed the analysis described in the paper using simple code that I wrote in ANSI C, contained entirely in the following archive file:
There are 23 programs included:
There are also 69 executables of unit tests and death tests for the core libraries (not the genome-specific libraries) provided.
Instructions for building the code are contained in the _README file within the archive.
This page describes personal hobby research that I undertook in 2025. All opinions expressed herein are mine alone. All code provided here is from my personal codebase, and all code and data is supplied under the MIT-0 License.
© 2025 John Costella