Advanced Bear tutorial

If you have not worked through the simple tutorial and the intermediate tutorial, I strongly recommend that you work through them first.

Explaining Bear’s model for two nondegenerate bivariate data points

In the simple tutorial we saw that Bear’s model for two nondegenerate bivariate data points found a nontrivial model matching the input data perfectly in (8.0336 ± 0.0024)% of updates (with 95% confidence), and I promised to explain analytically why that happens in this tutorial. So that is what I will now do. But apart from what is in the simple tutorial, and below, I have not yet put up a full description of how Bear works on this website, but it is documented in more detail in the codebase (in Bear.h). If you want to fully understand what follows, please read that description first.

As described briefly in the simple tutorial, Bear’s fundamental statistical test is applied when it “reduces” a Bear hypercube. It decides whether to merge two adjacent indexes of a given field by comparing the empirical density distribution of the two hyperslices perpendicular to each of those indexes. It computes an approximate chi-squared value of the density difference between adjacent hypercells, approximating the Poisson process by a Normal distribution, and using an approximate “spontaneous emission” estimate of the variance of the estimate of the Poisson mean (which was justified by a Bayesian analysis performed by Dr. Dan Merl in 2014).

For two nondegenerate bivariate data points, the Bear hypercube is just a 2×2 square, with a frequency of 1 in two diagonally opposite cells and 0 in the other two diagonally opposite cells. When analyzing either dimension, the difference in density between any adjacent pair of cells is therefore always just 1. Using the above approximations, the variance of this estimated difference is 3. The chi-squared value for any decision is therefore just 2/3. With two degrees of freedom (the number of hypercells compared, with no parameters or constraints having been applied that would reduce this number), the chi-squared distribution (e.g., in R it is just pchisq(2/3,2)) will tell you that the approximate probability that these density distributions are statistically significantly different is 0.2834687. Of course, this is a fairly pathological corner case, for which the approximations are least accurate, and so this probability doesn’t represent anything fundamentally correct—but it is what Bear computes for this case.

Because Bear Forest is a Monte Carlo engine, it “rolls the dice” and 28.3% of the time it decides to not merge these two indexes. Let us consider one of those cases. Bear then considers the other dimension. Again, there is a 28.3% chance that it will decide to not merge the two indexes. Since these random decisions are statistically independent, there is therefore an approximately 8% chance that it decides that the entire 2×2 square is statistically significant, and leaves it as such rather than merging it down into a single uniform square of probability density. A more precise value of the rate is 0.28346872 ≈ 0.080355, or 8.0355%, which agrees with our empirical estimate of (8.0336 ± 0.0024)%.

Roll your own executables

At this time I have shown you two main programs in these tutorials: memory_bear and bear_predict. They are general-purpose, and are convenient if you want to read and write text files. I intend to provide more such programs, such as for Convolutional Bear and Bear AI, in the future.

These programs are thin wrappers around the actual Bear class that does the work. You are free to roll your own executables to wrap the Bear engine however you like (or you can ask me if I’d consider writing it myself). The API for the Bear class is very simple: apart from some tiny ancillary methods, there are just two main methods that do the work:

bear_new()
Constructor. You give it training data and a time budget, and it builds a model.
bear_predict()
Make a prediction. You give it the feature values for an example and it predicts the label values.

Memory efficiency of Bear

Bear sets a default limit of “swappable memory” to three-quarters of total memory, and automatically swaps data stored in this swappable memory out to storage when the limit is hit. This limit can be changed using the -g option in memory_bear. Most (but not all) of the data used by Bear is stored in this swappable memory.

However, this version of Bear is still “mainly in-memory” for the training phase: the input data should fit into memory.

I have future plans of building a fully distributed, Big Data version of Bear (“Big Bear”) which will be able to stream arbitrary amounts of training data from storage through distributed Bear training engines in a standard map–reduce type of architecture.

Where did “Bear on the MNIST database” go?

In the initial release of Bear I showed that by processing the MNIST images into pyramids of images, diff images, and edges, and extracting the JPEG DCT coefficients of those three pyramids of images, the original implementation of Bear was able to classify the MNIST database with over 98% accuracy with only five minutes of training.

In May 2023 I thought I had found a good generalization of the image pre-processing phase, which I dubbed “Convolutional Bear,” by analogy with Convolutional Neural Networks. However, by the end of that month I proved to myself that this initial attempt was wrong.

I currently have working plans for a correct formulation of Convolutional Bear, which I plan to develop in the future. In the mean time I have removed all the MNIST programs and tutorials from this site, as superseded.

What’s next?

Play with Bear. Tell me about any bugs. Tell me all the stupid decisions I made that I will kick myself once you explain them to me, and then buy you a beer for making Bear that much better.

Please leave me this feedback on Bear’s Facebook page (really, where did you think I was going to put it?).

Happy Bear-wrangling!