The human genome
NOTE
I planned to publish this page on September 11.
Today’s assassination has destroyed any joy from
this project.
Dedicated to the memory of Charlie Kirk.
Introduction
The plummeting cost of
high-depth whole genome sequencing
means that amateurs like me can play with the raw data.
In the pursuit of a
different personal hobby project,
in September 2024
Sally and I each purchased
100X
Whole Genome Sequencing
from Nebula Genomics.
We received the data about six weeks later.
I ultimately ramped up that project in June 2025.
In August 2025, as a diversion,
I wandered from the goal of that project and slightly modified
the method I was using so that I could play with and better
understand the raw sequencing data itself.
The results that I got the next day, August 18,
were so surprising and baffling
that within
24 hours
I started writing up a paper, even though I had no idea
what the explanation was.
I planned to publish
the completed paper here
on September 11, 2025.
I found that our haploid genome is twice as long as the
reference human genome, with half of our
sequences missing from the reference genome.
I speculated that this is true for all humans.
The Shadow Genome Project
Once I realized that half of our genome is missing from the
reference human genome,
I knew that it would be important to map out
what I’m calling the “shadow genome.”
I continued to build out and refine the
rudimentary algorithm described in the above paper to grow
out the sequences one base at a time, not into full
assemblies, but at least into sequences long enough to
demonstrate that my analysis was not flawed,
and as a proof of concept for a more extensive study.
I also thought of new ways to establish whether I had made a mistake
when concluding that the haploid genome is twice as long as we
think it to be.
I am providing here supplementary papers summarizing the progress of
that work:
Sally and I pondered the question of whether we should really release our
genomic sequences publicly.
It’s normally advised against, since
insurance companies or potential employers could use the information
in them to discriminate
against us.
But Sally and I decided that at this point in our lives there is little
harm that these companies could now inflict upon us.
Go ahead.
Make our day.
At the current time, I do not have the actual genomic sequence files
computed yet.
Please check back soon.
We therefore release the three following custom-format text files
for public use under the
MIT-0
License:
The format is a modified version of the FASTA format:
-
A line starting with # is a comment, and can be ignored.
-
A line starting with > marks the start of a sequence.
The rest of the line will usually contain metadata, but can be ignored.
-
Lines are no longer than 80 characters, excluding the newline,
and newlines other than those described above must be ignored.
-
Only the base codes A, C, G, and
T are used.
-
A pair of square brackets signifies that variant subsequences
were seen at that position of the sequence.
Inside the square brackets is a list of two or more variants,
in alphabetical order.
Each variant begins with the sequence of bases in parentheses,
followed by the number of haploid genomes in which it was seen,
in decimal,
in curly braces (which must sum to 2 or 4 for our data files above,
but which in general
will sum to twice the number of people whose data has been combined).
Examples:
-
ACGC[(A){1}(T){3}]GTCA
signifies that after ACGC the
variant A was seen in one haploid genome and
the variant
T in three haploid genomes.
All four genomes then continue with
GTCA.
(E.g., a single-nucleotide variant for our combined data.)
-
[(A){2843}(C){8467}]
signifies that the variant A
was seen in 2,843 haploid genomes
and the variant C was seen
in 8,467 haploid genomes.
(E.g., a single-nucleotide variant for the combined data
of 5,655 people.)
-
[(){1}(ACG){1}]
signifies that
there were no bases between the surrounding parts
of the sequence in one haploid genome,
and ACG was seen in the other
haploid genome.
(E.g., an indel in Sally’s data.)
-
[(){18374}(CAGCAG){75397}(TAT){54917}]
signifies that
an empty subsequence
was seen in 18,374 haploid genomes,
CAGCAG was seen in 75,397
haploid genomes, and
TAT was
seen in 54,917 haploid genomes.
(E.g., a more complicated set of variants for 74,344 people.)
-
A variant specification in square brackets can be broken by a newline
character, because newline characters are ignored.
As the three files above arguably provide much higher-fidelity intelligence
on our genomes
than the raw data, we are
more than happy to provide the actual raw data files that we received
from Nebula Genomics to anyone who wants to replicate my analysis.
But as they're 0.82 TB in total, it's not feasible for me to serve them
from this site.
If anyone is willing to host them, please
contact me.
Code
I performed the analysis described in the paper
using simple code that I wrote in
ANSI C, contained entirely in the
following archive file:
There are 16 programs included:
- nebula_fastq_to_genome_file
-
Write the unambiguous bases from a pair of Nebula files to my
GenomeFile format.
- nebula_fastq_dump
-
Dump the unambiguous bases from a pair of Nebula files
into a text file of ACGT bases, with a newline only after
each sequence.
- genome_file_dump
-
Dump the bases in one of my GenomeFile
files in the same way.
- genome_file_metadata
-
Print the metadata stored at the start of a GenomeFile,
and optionally the number of bases in each fragment.
- genome_file_head
-
Like the head utility, create a subset of a given
GenomeFile that contains just the first n bases.
- genome_file_freqs
-
Compute the frequencies of all 16-base needles for a
GenomeFile and save the results
in a GenomeFreqs file.
- ref_fasta_freqs_positions
-
Compute the frequencies and positions of
reference genome needles, saving to
GenomeFreqs and RefGenomePositions files.
- genome_freqs_metadata
-
Show some basic metadata for a GenomeFreqs file
- genome_freqs_freqs
-
Compute the frequency table of the needle frequencies
less than 255 in a GenomeFreqs file.
- genome_freqs_unigrams
-
Compute the unigram frequencies from the needle frequencies
in a GenomeFreqs file.
- genome_freqs_bigrams
-
Compute the bigram frequencies from the needle frequencies
in a GenomeFreqs file.
- genome_freqs_joint_freqs
-
Compute the joint frequency table for the needle frequencies
less than 255 in two GenomeFreqs files.
- genome_freqs_joint_freqs_shadow
-
Same, filtering to those in the shadow genome or the reference
genome.
- genome_freqs_combine
-
Combine two GenomeFreqs files into a single
file with the sum of frequencies for each needle.
- nebula_fastq_biases
-
Compute the frequency of each base at each read position for each
of a pair of Nebula files.
- grow_genome
-
Grow the genome using the method described in my paper.
There are also 69 executables of unit tests and death tests for the core
libraries (not the genome-specific libraries) provided.
Instructions for building the code are contained in the _README
file within the archive.
Disclaimers
This page describes personal hobby research that I undertook in 2025.
All opinions expressed herein are mine alone.
All code
provided here is from my personal codebase, and all code and data
is supplied under the
MIT-0 License.
© 2025 John Costella