If you have not worked through the simple tutorial and the intermediate tutorial, I strongly recommend that you work through them first.

In the simple tutorial we saw that Bear’s model for two
nondegenerate bivariate data points found a nontrivial model matching
the input data perfectly in
(8.0336 ± 0.0024)% of updates
(with 95% confidence), and I promised to explain analytically
why that happens in this tutorial.
So that is what I will now do.
But apart from what is in the simple tutorial, and below,
I have not yet put up a full description of how Bear works on this
website, but it is documented in more detail in the codebase
(in `Bear.h`).
If you want to fully understand what follows, please read that
description first.

As described briefly in the simple tutorial, Bear’s fundamental statistical test is applied when it “reduces” a Bear hypercube. It decides whether to merge two adjacent indexes of a given field by comparing the empirical density distribution of the two hyperslices perpendicular to each of those indexes. It computes an approximate chi-squared value of the density difference between adjacent hypercells, approximating the Poisson process by a Normal distribution, and using an approximate “spontaneous emission” estimate of the variance of the estimate of the Poisson mean (which was justified by a Bayesian analysis performed by Dr. Dan Merl in 2014).

For two nondegenerate bivariate data points, the Bear hypercube
is just a 2×2 square,
with a frequency of 1 in two diagonally opposite cells and
0 in the other two diagonally opposite cells.
When analyzing either dimension, the difference in density between
any adjacent pair of cells is therefore always just 1.
Using the above approximations, the variance of this
estimated difference is 3.
The chi-squared value for any decision is therefore just 2/3.
With two degrees of freedom (the number of hypercells compared,
with no parameters or constraints having been applied that would
reduce this number),
the chi-squared distribution
(e.g., in R it is just `pchisq(2/3,2)`)
will tell you that the
approximate probability that these density distributions are
statistically significantly different is 0.2834687.
Of course, this is a fairly pathological
corner case, for which the approximations are least accurate,
and so this probability doesn’t
represent anything fundamentally correct—but it is what Bear computes
for this case.

Because Bear Forest is a Monte Carlo engine,
it “rolls the dice”
and 28.3% of the time it decides to *not* merge these two indexes.
Let us consider one of those cases.
Bear then considers the *other* dimension.
Again, there is a 28.3% chance that it will decide to not merge the
two indexes.
Since these random decisions are statistically
independent, there is therefore an approximately
8% chance that it decides that the entire 2×2 square is
statistically significant, and leaves it as such rather than merging
it down into a single uniform square of probability density.
A more precise value of the rate is
0.2834687^{2} ≈ 0.080355,
or 8.0355%,
which agrees with our empirical estimate
of (8.0336 ± 0.0024)%.

At this time I have shown you two main programs
in these tutorials:
`memory_bear` and `bear_predict`.
They are general-purpose,
and are convenient if you want to read and write
text files.
I intend to provide more such programs, such as for Convolutional Bear
and Bear AI, in the future.

These programs are thin wrappers around the actual
`Bear` class that does the work.
You are free to roll your own executables to wrap the Bear engine
however you like (or you can ask me if I’d consider writing it myself).
The API for the `Bear` class is very simple: apart from
some tiny ancillary methods,
there are just two main methods that do the work:

`bear_new()`- Constructor. You give it training data and a time budget, and it builds a model.
`bear_predict()`- Make a prediction. You give it the feature values for an example and it predicts the label values.

Bear sets a default limit of
“swappable memory” to three-quarters of total memory,
and automatically swaps data stored in this swappable memory out to
storage when the limit is hit.
This limit can be changed using the `-g` option in
`memory_bear`.
Most (but not all) of the data used by Bear is stored in this swappable
memory.

However, this version of Bear is still “mainly in-memory” for the training phase: the input data should fit into memory.

I have future plans of building a fully distributed, Big Data version of Bear (“Big Bear”) which will be able to stream arbitrary amounts of training data from storage through distributed Bear training engines in a standard map–reduce type of architecture.

In the initial release of Bear I showed that by processing the MNIST images into pyramids of images, diff images, and edges, and extracting the JPEG DCT coefficients of those three pyramids of images, the original implementation of Bear was able to classify the MNIST database with over 98% accuracy with only five minutes of training.

In May 2023 I thought I had found a good generalization of the image pre-processing phase, which I dubbed “Convolutional Bear,” by analogy with Convolutional Neural Networks. However, by the end of that month I proved to myself that this initial attempt was wrong.

I currently have working plans for a correct formulation of Convolutional Bear, which I plan to develop in the future. In the mean time I have removed all the MNIST programs and tutorials from this site, as superseded.

Play with Bear. Tell me about any bugs. Tell me all the stupid decisions I made that I will kick myself once you explain them to me, and then buy you a beer for making Bear that much better.

Please leave me this feedback on Bear’s Facebook page (really, where did you think I was going to put it?).

Happy Bear-wrangling!

© 2023–2024 John Costella