You can learn how Bear works by reading or working through this simple tutorial.

You have two options:

- build and run Bear using this guide; or
- don’t build or run Bear: just download the output I provide, if you trust me.

All the commands I use below are listed here.

If you have built Bear and added it to your path, then execute this command:

`
$ memory_bear
`

You should see output that looks something like this (details of all screenshots may vary):

You can see from the “`Arguments:`”
section at the bottom of
this screenshot that `memory_bear` has two mandatory arguments:
`INPUT_FILENAME` and `TIME_BUDGET`.
Let’s just create an empty input file,

`
$ touch empty.csv
`

and run `memory_bear` on it, specifying a time budget of,
say, one minute:

Okay, so we learn that it’s mandatory to specify the label column(s) using one of these four options. Let’s just specify it as column 0:

So we also need to specify a filename for saving the Bear model or writing an output file with predictions (or both). Let’s just specify a Bear model filename:

Now we’re getting somewhere! Bear fired up with a welcome message, and then some feedback to us of its parsing of what we have asked of it. At the left side of each log line you will always see the local time (to the minute) and the time that has elapsed since the last log line. (The first log line tells you the local date when execution started.)

We can see that Bear did a first pass over the input file, but then it told us that an empty data file isn’t allowed!

So let’s create the simplest possible dataset: a single example (row),
with no features, and just a label value,
in `single.csv`:

We can now run `memory_bear` on this dataset successfully:

All of these steps will become clearer as we work through these tutorials,
but in the end we see that Bear built a model and saved it to
`single.bear.gz`.
If we take a look at the decompressed bytes in that file,

we can see that it consists of binary data within plain text tags; this is
the general way that Bear saves objects.
It runs it through `gzip` to compress these 179 bytes down to 73.

We can get an overview of what is in this model file using the supplied program
`bear_model_details`:

That’s not too enlightening in this case!
We can get a bit more detail using the `--verbose` option:

We see that there is just an “empty” model, which is the model that Bear creates without using any features at all. This makes sense, because there were no features! This empty model records that the minimum and maximum allowed label values are 42, because this was the only label value in the input data (and Bear never extrapolates), and its single prediction for the label is likewise 42.

We can stream feature data through this model and get it to make predictions
using the supplied program `bear_predict`.
Its command-line options are similar to those of
`memory_bear`:

As the argument specifications at the bottom of this help screen show, we can
run it in “interactive” mode by specifying
`stdin` and `stdout` for the input and output files,
although we need to specify the filetypes for each:

At this point, the program is waiting for us to specify feature values for an
example.
In this case there are no features, so if we hit the `return` key, it
spits out its prediction:

We can do this as many times as we want:

If we’ve taken more than five seconds to do this, we'll even be given a “progress update” on the number of rows processed so far:

After doing this a fourth time, the fun has probably worn off, and we can
finish our input by pressing `control-D` and `return`:

Let’s make things slightly more interesting by having more than
one example in our dataset.
For example,
`10-labels.csv`:

If we run `memory_bear` on this dataset,
now using short option names,

`
$ memory_bear 10-labels.csv 1m -l 0 -s 10-labels
`

and look at the details of the model created,
`10-labels.bear.gz`,

we can see that the empty model now records a minimum allowed label of −370, a maximum allowed label of 120, and a constant prediction of −23.2. The first two of these are just the bounds of the 10 input labels; Bear never extrapolates beyond the data it is given. Likewise, the prediction of −23.2 is just the mean value of those 10 input labels, which minimizes the MSE loss (the default for Bear) if the empirical probabilities are taken as the best estimate of the true probability distribution.

As before, we can run `bear_predict` in interactive mode
to stream example feature data (again, here we have no features)
through the model:

where this time our fun was expended after two hits of the
`return` key, after which I hit `control-D`
and `return` to end the input datastream.

Let’s now add a frequency column to our 10 examples, to create
`frequencies.csv`:

This just means that we have 4 examples with a label value of −1.3, one example with 4.7, and so on. This is completely equivalent to having a data file with four rows with label value −1.3, etc.

We can specify that our input file has a frequency column by
using the `--has-frequency-column` and
`--frequency-column` options
(here in their short forms `-f` and `-c`):

`
$ memory_bear data-frequencies.csv 1m -l0 -f -c1 -sfrequencies
`

which Bear parses and includes in its feedback to us:

We now see that the model,
`frequencies.bear.gz`,
is similar to the previous one,

except that the prediction is now −35.84838709677419.
Note that my codebase automatically includes separators
in its logging, but these are never added in output
files.
We can confirm both of these points
if we run `bear_predict`
on the model:

This is just the weighted average of the input label values, where each weight is just the relative frequency; e.g., for the first label value of −1.3 it is 4 / 31, since the total frequency is 31; and so on.

OK, enough with datasets with just labels an no features.
Let’s add a feature!
Here is a simple dataset
`linear-1.csv`
where the second (label) column is
obviously linearly dependent on the first (feature) column:

We know how to run Bear on this, where we now just have to specify that the label column is column 1 (i.e., the second column):

`
$ memory_bear linear-1.csv 1m -l1 -slinear-1
`

We see Bear doing a lot more than it did for the empty models.
Skipping these details, for the moment,
we excitedly run `bear_predict` to see the results of
our linear regression, now typing a feature value before hitting
`return`:

Well that was disappointing! No matter what feature value we entered, the model gave us a label prediction of 53. It even did this if we didn’t specify a feature value at all!

Why didn’t we get any linear regression?
We can again examine the model
`linear-1.bear.gz`,
to try to debug this:

This is just the empty model again! Its constant prediction of 53 is just the mean value of the input labels. But why did Bear just give us the empty model?

The answer is that Bear only gives us *statistically significant*
structure that it finds in the data.
In this case it decided that these 9 data points didn’t give it
any statistically significant signal of anything more than just the
empty model.
And that sounds fair enough: without any other information about what
sort of relationship we are expecting to find, in general it would be
difficult to draw any concrete conclusions from just 9 data points.

So let’s give Bear more of our linear data, so that it might have a chance of finding something statistically significant. One easy way to do that is to add a frequency column to our dataset, and set the frequency of each of our nine examples to, say, 20:

If we run Bear on this,

`
$ memory_bear linear-2.csv 1m -l1 -fc2 -slinear-2
`

and then run `bear_predict` on the created model
`linear-2.bear.gz`,

we see that Bear has modeled the 9 data points exactly! Of course, that’s only because our dataset had no noise at all: every feature value mapped exactly to a single label value for every one of its 20 examples, and Bear decided that each of these mappings was statistically significant in itself. This perfect modeling is reflected in Bear declaring the “overall model strength” (which I will describe later) to be 10,000,000,000, which is an arbitrary upper bound that I apply in the code. Real datasets will not generally be both noiseless and statistically significant.

Looking more at my above play with
`bear_predict`,
you can see that
if we specify a feature value between
two values in our original dataset—here 1.4 and 1.6—Bear
doesn’t linearly interpolate, like you might expect from
linear regression; rather it
gives us the prediction 13 of feature value 1 for the former,
and the prediction 23 of feature value 2 for the latter.
It seems to be using the *nearest* feature value in the original
dataset.
Moreover, if we specify the feature to be less than 1, it predicts the smallest
label, 13; if we specify the feature to be greater than 9, it predicts
the largest label, 93, so it doesn’t extrapolate either.
These are general features of Bear: its predictions are
*piecewise constant*, and do not exceed the bounds of the input label
data.
In this case there are 9 of these pieces, which surround each of the 9 feature
values in the input dataset, with the pieces on the ends continuing on
to negative and positive infinity.

To show this more explicitly, I have created a file
`linear-2-test-features.csv`
of feature values
spanning the interval from −2 to +12, stepping by 0.1 each time.
We can pass those into
`bear_predict`, and ask it to write its predictions out to
the file
`linear-2-test-out.csv`:

You can graph the results using whatever program you like; for simplicity, I have just used Microsoft Excel:

This shows you visually that Bear has created its “perfect” model of this noiseless data as piecewise constant.

If you play around with `bear_predict` some more, you will find
that the prediction does indeed jump up to 23 at a feature value of 1.5,
or half-way between the two input feature values of 1 and 2.
But if you bisect even more, you might be surprised that it actually jumps up
at around 1.498046875.
What’s going on here?

The answer is that Bear internally uses a custom 16-bit floating point
representation, that I dubbed “`paw`,”
in the core engine that does the statistical modeling.
The `paw` format
is very similar to Google Brain’s `bfloat16` format,
except that `paw` has 7 bits of exponent and
8 bits of mantissa, whereas `bfloat16` has
8 bits of exponent and
7 bits of mantissa.
Google chose one less bit of precision that I did for Bear because
they
had competing design goals due to a legacy codebase
that made it advantageous
for `bfloat16` to have the same dynamic range as the standard
32-bit `float`.
I had no such constraints, and could
let `paw` have one extra bit of precision,
since the dynamic range of
`paw`
of around 10^{±19}
is more than sufficient for all practical purposes,
compared to around
10^{±38} for `bfloat16`.

The result is that feature values greater than 1.5 − 1 / 512 round up to 1.5 in this core modeling.

Because Bear’s models are piecewise constant in feature space
anyway, this quantization of the thresholds between adjacent pieces
*usually* has no
significant practical ramifications.

Note, however, that if you have a feature that is not an
*extensive*
quantity—for example, temperatures,
or Unix timestamps that are
arbirarily referenced to 1970, or positions in space referenced to some
arbitrary origin—then this limited precision could quantize all
the values of such a feature to the same quantized value.
In such cases you should transform the feature into a sensible range;
for example, timestamps relative to the earliest time in the data.

In completely pathological situations where even this doesn’t
work—for example, values bunched around two particular values, and the
deviations around those values are likely significant—you can always
transform those values in any nonlinear but monotonically nondecreasing
way you like.
Bear only cares about the *ordering* of the feature values, not their
actual numerical values,
except when interpolating
between the “fold” (“break”) points
that it inserts between feature values in your training data
(like between 1 and 2 above).

So now that we have more than an empty model, let’s examine in more
details what’s actually in
`linear-2.bear.gz`.
Let’s start with the *non*-verbose version of the program:

We still only have one model (labeled with the index 0), but
now it has a “weight” with that 10^{10}
upper-bound “strength” we saw above, and it has an
“assembly” which is “`e|0`”
rather than just the
“`e`” we had before.
I’ll describe these “weights” in more detail in later
tutorials, but for now just take it as the “goodness” of
a model.
The “assembly” `e|0` just tells us that
this model has used
the empty model, and then modeled its residuals with feature 0.
(This will become clearer when we have more complicated models.)

If you now run this command in verbose mode,

`
$ bear_model_details linear-2 -v
`

then you will see essentially all the internal details of this Bear model file. Without getting into the weeds of those details, if you read from the bottom you will see that Bear models the labels with the empty model, and then tries to model the residuals of that model (which are between −40 and +40) with the feature that we have supplied. In this case it succeeded in finding statistical significance in that residual modeling.

As a convenience, `memory_bear` lets you include prediction feature
values in the same input file as your training data, and it will make
predictions for those feature values after it finishes creating its model.
All you need to do is include those prediction feature rows in your input
file *with an empty label field*.
For example,

`
$ cat linear-2.csv linear-2-test-features.csv > linear-2-combined.csv
`

simply appends the prediction feature rows to the training rows. The label values in column 1 are implicitly missing for these rows (since there is no column 1), which marks them as prediction rows. Frequencies are never needed for prediction rows, so it doesn’t matter that column 2 is also missing for these rows.

We now have to specify an output filename for the predictions to be written out to. (In this mode, it is optional whether you want to save the Bear model to a file or not.) So the command

`
$ memory_bear linear-2-combined.csv 1m -l1 -fc2 -o linear-2-predictions.csv
`

trains the model and then makes predictions for our 141 prediction rows,
writing the results out to
`linear-2-predictions.csv`.
We can easily prove that the predictions are identical to those
obtained above:

`
$ cmp linear-2-predictions.csv linear-2-test-out.csv
`

Bear also allows you to specify that one or more columns in your input data
file should be simply passed through as plain text
to the corresponding row of the output
file, without playing any role in the actual modeling or predictions.
This can be useful if one of your columns is a primary key, or if multiple
columns together form a composite primary key, or even if some columns are
simply comments or other descriptive text.
For example, if we add an identifier column and a comment column to
`linear-2.csv`, and add a few prediction rows, to create
`ids.csv`:

and then specify to `memory_bear` that columns 0 and 4
are “ID” (passthrough) columns,

`
$ memory_bear ids.csv 1m --multi-id-columns='[0,4]' -l2 -fc3 -o ids-out.csv
`

then we can see that these two columns are ignored, but passed through for the prediction rows to the output:

If you specify one or more identifier columns in this way, you may not actually
need or want
to see the actual feature value(s) for those rows.
To suppress their output you can just specify
`--no-features-out`:

`
$ memory_bear ids.csv 1m -j'[0,4]' -l2 -fc3 -o ids-out-nf.csv --no-features-out
`

Now in the output you just see your ID columns and the corresponding predicted label:

Although you can specify to `memory_bear` and
`bear_predict` any arbitrary columns to be
labels or identifiers,
both programs write out predictions with all identifiers
first, followed by all features (unless specified otherwise),
followed by all labels,
in each case in the order that the columns appeared in the input data.
If you need an alternative permutation of the columns in the output file
you should use another utility to achieve that result.

We’ve played enough with our noiseless dataset
`linear-2.csv`, so let’s generate some data that at least
has some noise added to it.
You can do this yourself using whatever program you like, but
I’ll use the supplied program
`simple_bear_tutorial_data`
so that you have the same data:

`
$ simple_bear_tutorial_data linear-3.csv -r19680707
`

which creates the file
`linear-3.csv`
with 50 training rows and 250 prediction rows in it.
The final argument `-r19680707` simply ensures that you seed
the random number generator the same as I did, so that you get
exactly the same data.
If you graph the data you should see something like this:

If you now run `memory_bear` on this data,

`
$ memory_bear linear-3.csv 1m -l1 -o linear-3-predictions.csv
`

you should now see an “overall model strength”
of around 8.25.
You don’t have anything to compare this with, yet,
but at least it doesn’t sound as silly as the
10^{10} we got for the perfect model.
Graphing
`linear-3-predictions.csv`
you should see something like

Again, it is piecewise constant, as Bear’s models always are. Indeed, Bear’s model here is like a decision tree on its single feature, where it has determined all the decision points at once. When we add more features the similarities with decision trees will remain evident, but so too will be the differences with how Bear’s algorithms determine the decision points for each feature.

It would be nice to be able to see Bear’s predictions on the same
axes as the input data.
The `memory_bear` program makes that easy, by using the
`--debug` flag:

`
$ memory_bear linear-3.csv 1m -l1 -d -o linear-3-debug.csv
`

Opening
`linear-3-debug.csv`,
you should see that the first 50 rows are just the original data,
with two extra
columns that I’ll return to shortly, followed by the 250
prediction rows.
If we graph just the first two columns, we get what we wanted:

We see that Bear has done a pretty good job of extracting out some piecewise constant dependencies, given the amount of data available and the amount of noise present.

But *is* this really the best that Bear could do under these
circumstances?
Apart from simply believing me that this is about as much that can be extracted
with statistical significance,
without any other *a priori* knowledge of the dependence of
the label on the feature,
we can also look at the *residuals* of this model.
This is where the two extra columns in debug mode are useful.
The third column just provides us Bear’s predictions for the
training examples:

and the fourth column provides the residuals of the training labels over these predictions:

Visually, this looks pretty convincing: there are no clear areas where a piecewise constant model would fit these residuals with any degree of statistical confidence.

We’ve seen that Bear has done a reasonable job of modeling noisy data with a linear dependence with 50 data points. But is that specific to the particular dataset that I created above? What if we change the random seed? For example,

`
$ simple_bear_tutorial_data linear-4.csv -r19660924
`

which creates the dataset

which actually looks a little “smoother”
than `linear-3.csv`.
(Of course, this is all just due to the random noise.)
Running Bear on this dataset,

`
$ memory_bear linear-4.csv 1m -dl1 -olinear-4-predictions.csv
`

we see in
`linear-4-predictions.csv`
that it now only decided to split the feature into *two* pieces:

In effect, Bear also “saw”
the “lumpiness” of the middle
portion of `linear-3.csv`, which wasn’t repeated
in `linear-4.csv`,
and deemed it sufficiently “lumpy” to create a piece there.
Bear doesn’t know if structure that it sees in the
input data is representative of the underlying relationship or just
random noise, just like we don’t (if we don’t look at
`simple_bear_tutorial_data` to learn how the
pseudorandom data was generated, of course!),
and forms its best guess
based on the statistical significance of what it does have.

But still, looking at the scatterplot above, we might wonder if Bear might not have squeezed out a third piece, since there is such an “obvious” linear variation in each of the two pieces it has. But if we look at the actual residuals of that model,

then it becomes less clear.
Certainly, there is not enough data to split these residuals into
a statistically significant piecewise model.
But that’s based on the two pieces that Bear actually found;
our question is whether it could have alternatively found *three*
pieces.
Even doing it by eye, it is difficult to see how Bear could have done this.
Moreover, note that Bear does *not* try every possible splitting
of the feature, not only because this would not be computationally tractable,
but also because the exponential explosion in the number of decisions would
hurt Bear’s ability to find statistical significance at all, since it
keeps track of the “multiple comparisons” problem.

We can also look at what happens when we add more training data. Let’s return to the original random seed, and specify that we want 1000 rows of training data rather than the default 50:

`
$ simple_bear_tutorial_data linear-5.csv -r19680707 -t1000
`

which creates

Running `memory_bear`,

`
$ memory_bear linear-5.csv 1m -dl1 -olinear-5-predictions.csv
`

now yields

where we now have a model with six pieces. The residuals again look reasonable:

Looking at them and the modeling above, you could *almost*
imagine breaking some of the pieces in half.
But that “always” is the point: there is just not enough
statistical significance in the amount of data we have for each piece to
overcome the inherent noise in the data.

Of course, if you add more and more data, there is more opportunity for
extra structure to be resolved despite the noise.
Using 10K data points gives you 8 pieces;
using 100K gives you 12 pieces;
using 1M gives you 25 pieces;
and using 10M gives you 54 pieces.
(This is easiest to see if you save the Bear model and inspect it
using `bear_model_details` in verbose mode.)

Incidentally, if you have looked at the help screen for
`simple_bear_tutorial_data` you will have seen that the default
underlying relationship between the label y and feature x is actually
y = 3 x + 10, which has been well modeled
by Bear.

In the real world you will often be missing data for some features for some examples. Bear handles missing feature values.

To see how this works, let’s create a dataset like
`linear-3.csv`, but with around half of the rows having a
missing feature value.
We can do this using the `--missing-percentage` option to
`simple_bear_tutorial_data`:

`
$ simple_bear_tutorial_data missing.csv -r19680707 -n50 -t100
`

where `-n50` sets this “missing percentage” to 50%.
I’ve also upped the total number of training rows to 100 so that
about 50 of them will still have feature data.
Indeed, if you inspect
`missing.csv`
you will see that there are label values for the first 100 rows,
but for 55 of them there is no feature value:

Note that the label values for examples with missing features are
clustered around 110.
This is because the `--missing-bias` default is 100, which is
an extra bias added to the label of all rows with a missing feature value,
in addition to the default `--bias` of 10, so that the expectation
value of the label for examples with a missing feature is 110.
(The default `--weight` of 3 does not come into play, because there
are no feature values to be correlated with for these examples.)

Usually, after the training examples we see the prediction examples. But in this file we see a row with no values at all:

This *is* actually a prediction row
(since its label is missing), but for the case when the feature
value is missing.
After this row are the standard 250 prediction rows that the program has
given us each time.

If you graph the 45 examples in missing.csv that do not have a missing
feature value, you will see that they follow the same general
pattern as `linear-3.csv` and `linear-4.csv`:

Running `memory_bear` on this data,
and saving the Bear model file,

`
$ memory_bear missing.csv -dl1 1m -omissing-predictions.csv -smissing
`

we see from
`missing-predictions.csv`,

that the prediction for a missing feature value is almost 110, and if we graph the predictions for the examples without missing feature values,

that Bear has modeled these similarly to the datasets above without missing feature values.

You might have noticed Bear quoting its “overall model strength” as over 761! Again, we haven’t yet discussed what these “strengths” and “weights” actually are, quantitatively, but 761 seems significantly better than the single-digit strengths previously noted. We can get some insight into what is going on here if we inspect the model file, in verbose mode:

`
$ bear_model_details missing -v
`

There is a fair bit of detail in the output, but if you read it from the bottom, you will see three models listed in a parent–child chain:

- The empty model, which makes a constant prediction of around 64.5. This the mean value of all labels in the input dataset.
- A “completeness” model. This models whether each example is “complete,” i.e., does not have any missing feature values. It subtracts around 55.3 from its empty-model prediction of 64.5 for examples that are complete, yielding a prediction of around 9.2, and adds around 45.3 for examples that are incomplete, yielding a prediction of around 109.8.
- A “regular” model (i.e., neither marked as an “empty” nor a “completeness” model). This models the residuals of the completeness model above, using the feature value, for the complete examples. Its two piecewise-constant pieces are what is shown in the graph above.

When Bear computes a “weight” or “strength,” it is always normalized by reference to that of the empty model. Here the empty model is quite bad (but the best that can be done without any features): all of the actual label values are far above or below its constant prediction of around 64.5. The completeness model actually provides most of the improvement in this particular case, and the regular model against the feature provides some further improvement, ultimately giving the complete model a “strength” of over 761, compared to the empty model.

This example shows that Bear can handle missing feature values without needing to discard either features or examples.

And that’s about all that we’re going to look at for data that can be well-modeled using linear regression!

Note that my libraries automatically
handle text files that are compressed with `gzip`.
All that you need to do is specify a filename that ends in `.gz`,
and it will all happen automagically.
The command `gzcat` is a useful analog of `cat` for
such files.
Note that Bear always saves its model file in compressed format.

If you have followed along with (and hopefully enjoyed) all of the above, then feel free to move on to the intermediate tutorial.

© 2022–2023 Dr. John P. Costella