Simple Bear tutorial

You can learn how Bear works by reading or working through this simple tutorial.

How can I work through this tutorial?

You have two options:

  1. build and run Bear using this guide; or
  2. don’t build or run Bear: just download the output I provide, if you trust me.

All the commands I use below are listed here.

Running Bear

If you have built Bear and added it to your path, then execute this command:

$ memory_bear

You should see output that looks something like this (details of all screenshots may vary):

The help screen that you should get from memory_bear if
      you run it with no arguments

The help screen that you should get from memory_bear if you run it with no arguments

You can see from the “Arguments:” section at the bottom of this screenshot that memory_bear has two mandatory arguments: INPUT_FILENAME and TIME_BUDGET. Let’s just create an empty input file,

$ touch empty.csv

and run memory_bear on it, specifying a time budget of, say, one minute:

Running Bear with only the mandatory arguments

Running Bear with only the mandatory arguments

Okay, so we learn that it’s mandatory to specify the label column(s) using one of these four options. Let’s just specify it as column 0:

After specifying a label column

After specifying a label column

So we also need to specify a filename for saving the Bear model or writing an output file with predictions (or both). Let’s just specify a Bear model filename:

After specifying a model save filename as well

After specifying a model save filename as well

Now we’re getting somewhere! Bear fired up with a welcome message, and then some feedback to us of its parsing of what we have asked of it. At the left side of each log line you will always see the local time (to the minute) and the time that has elapsed since the last log line. (The first log line tells you the local date when execution started.)

We can see that Bear did a first pass over the input file, but then it told us that an empty data file isn’t allowed!

The simplest dataset

So let’s create the simplest possible dataset: a single example (row), with no features, and just a label value, in single.csv:

File with just a single label value

File with just a single label value

We can now run memory_bear on this dataset successfully:

Running memory_bear on data-single.csv

Running memory_bear on data-single.csv

All of these steps will become clearer as we work through these tutorials, but in the end we see that Bear built a model and saved it to single.bear.gz. If we take a look at the decompressed bytes in that file,

The bytes in single.bear.gz

The bytes in single.bear.gz

we can see that it consists of binary data within plain text tags; this is the general way that Bear saves objects. It runs it through gzip to compress these 179 bytes down to 73.

We can get an overview of what is in this model file using the supplied program bear_model_details:

Getting an overview of single.bear.gz

Getting an overview of single.bear.gz

That’s not too enlightening in this case! We can get a bit more detail using the --verbose option:

Getting some more detail of single.bear.gz

Getting some more detail of single.bear.gz

We see that there is just an “empty” model, which is the model that Bear creates without using any features at all. This makes sense, because there were no features! This empty model records that the minimum and maximum allowed label values are 42, because this was the only label value in the input data (and Bear never extrapolates), and its single prediction for the label is likewise 42.

We can stream feature data through this model and get it to make predictions using the supplied program bear_predict. Its command-line options are similar to those of memory_bear:

The help screen for bear_predict

The help screen for bear_predict

As the argument specifications at the bottom of this help screen show, we can run it in “interactive” mode by specifying stdin and stdout for the input and output files, although we need to specify the filetypes for each:

Running bear_predict on single.bear.gz in interactive mode

Running bear_predict on single.bear.gz in interactive mode

At this point, the program is waiting for us to specify feature values for an example. In this case there are no features, so if we hit the return key, it spits out its prediction:

After hitting the return key

After hitting the return key

We can do this as many times as we want:

After hitting the return key again

After hitting the return key again

If we’ve taken more than five seconds to do this, we'll even be given a “progress update” on the number of rows processed so far:

After hitting the return key a third time, more than five seconds
    after starting

After hitting the return key a third time, more than five seconds after starting

After doing this a fourth time, the fun has probably worn off, and we can finish our input by pressing control-D and return:

Finishing our fun with control-D

Finishing our fun with control-D

More than one example

Let’s make things slightly more interesting by having more than one example in our dataset. For example, 10-labels.csv:

The data file 10-labels.csv containing 10 label-only examples

The data file 10-labels.csv containing 10 label-only examples

If we run memory_bear on this dataset, now using short option names,

$ memory_bear 10-labels.csv 1m -l 0 -s 10-labels

and look at the details of the model created, 10-labels.bear.gz,

Details of 10-labels.bear.gz

Details of 10-labels.bear.gz

we can see that the empty model now records a minimum allowed label of −370, a maximum allowed label of 120, and a constant prediction of −22.6. The first two of these are just the bounds of the 10 input labels; Bear never extrapolates beyond the data it is given. Likewise, the prediction of −22.6 is just the mean value of those 10 input labels, which minimizes the MSE loss (the default for Bear) if the empirical probabilities are taken as the best estimate of the true probability distribution.

As before, we can run bear_predict in interactive mode to stream example feature data (again, here we have no features) through the model:

Running bear_predict on 10-labels.bear.gz

Running bear_predict on 10-labels.bear.gz

where this time our fun was expended after two hits of the return key, after which I hit control-D and return to end the input datastream.

Changing the loss function

I noted above that Bear’s default loss function is MSE, which is minimized if the predictions are the mean expectation values. Let’s change the loss function to MAE:

$ memory_bear 10-labels.csv 1m -l0 -nMAE -s10-labels-mae

and run bear_predict on it:

Running bear_predict on 10-labels-mae.bear.gz

Running bear_predict on 10-labels-mae.bear.gz

The prediction is now totally different: 3.5. This is because MAE loss is minimized when the prediction is the median expectation value, rather than the mean, or any arbitrary value in the closed interval between the two median values if there is an even number of data points. If you work it through, the median values for 10-labels.bear.gz are 3 and 4, and Bear has followed the normal practice of breaking the arbitrariness by taking the mean of these two values.

However, if we now look at the details of the model created, 10-labels-mae.bear.gz,

Details of 10-labels-mae.bear.gz

Details of 10-labels-mae.bear.gz

we see that things look a little strange. The “empty label estimate” is just our median 3.5, but there is also a “empty label prediction” of 0, and the minimum and maximum allowed label values don’t match our dataset. What’s going on here?

The answer is that to support the MAE loss function, Bear performs a trick by transforming the label values under the hood into a quantity linearly related to their cumulative frequencies, which is what you need to use to compute the median. These are the numbers that look wrong above. At the end of the process Bear transforms back to actual values, calling the final prediction the “label estimate.” You don’t need to worry about “how it makes the sausage” unless you are inspecting the details of the model file.

Specifying frequencies

Let’s now add frequencies (counts of the number of examples having the given label value), to create frequencies.csv:

Adding a frequency column

Adding a frequency column

This just means that we have 4 examples with a label value of −1.3, one example with 4.7, and so on. This is completely equivalent to having a data file with four rows with label value −1.3, etc.

We can specify that our input file has a frequency column by using the --has-frequency-column and --frequency-column options (here in their short forms -f and -c):

$ memory_bear frequencies.csv 1m -l0 -f -c1 -sfrequencies

which Bear parses and includes in its feedback to us:

Bear's feedback of our frequency column specifications

Bear’s feedback of our frequency column specifications

We now see that the model, frequencies.bear.gz, is similar to the previous one,

Details of frequencies.bear.gz

Details of frequencies.bear.gz

except that the prediction is now −33.321875. This is just the weighted mean of the input label values, where each weight is just the relative frequency; e.g., for the first label value of −1.3 it is 4 / 32, since the total frequency is 32; and so on. Note that my codebase automatically includes separators (here, a space in the decimal value) in its logging, but these are never added in output files. We can confirm this by running bear_predict on the model:

Running bear_predict on frequencies.bear.gz

Running bear_predict on frequencies.bear.gz

Again, if we switch to the MAE loss function,

$ memory_bear frequencies.csv 1m -l0 -fc1 -nMAE -sfrequencies-mae

and inspect the model file,

$ bear_model_details frequencies-mae -v

then we find that the “empty label estimate” is now 3, which is just the median of the values in frequencies.csv.

Adding a feature

OK, enough with datasets with just labels an no features. Let’s add a feature! Here is a simple dataset linear-1.csv where the second (label) column is obviously linearly dependent on the first (feature) column:

A simple linear dataset

A simple linear dataset

We know how to run Bear on this, where we now just have to specify that the label column is column 1 (i.e., the second column):

$ memory_bear linear-1.csv 1m -l1 -slinear-1

We see Bear doing a lot more than it did for the empty models. Skipping these details, for the moment, we excitedly run bear_predict to see the results of our linear regression, now typing a feature value before hitting return:

Running bear_predict on linear-1.bear.gz

Running bear_predict on linear-1.bear.gz

Well that was disappointing! No matter what feature value we entered, the model gave us a label prediction of 53. It even did this if we didn’t specify a feature value at all!

Why didn’t we get any linear regression? We can again examine the model linear-1.bear.gz, to try to debug this:

Details of linear-1.bear.gz

Details of linear-1.bear.gz

This is just the empty model again! Its constant prediction of 53 is just the mean value of the input labels. But why did Bear just give us the empty model?

The answer is that Bear only gives us statistically significant structure that it finds in the data. In this case it decided that these 9 data points didn’t give it any statistically significant signal of anything more than just the empty model. And that sounds fair enough: without any other information about what sort of relationship we are expecting to find, in general it would be difficult to draw any concrete conclusions from just 9 data points.

Adding more data

So let’s give Bear more of our linear data, so that it might have a chance of finding something statistically significant. One easy way to do that is to add a frequency column to our dataset, and set the frequency of each of our nine examples to, say, 20:

Adding more frequency in linear-2.csv

Adding more frequency in linear-2.csv

If we run Bear on this,

$ memory_bear linear-2.csv 1m -l1 -fc2 -slinear-2

and then run bear_predict on the created model linear-2.bear.gz,

Running bear_predict on linear-2.bear.gz

Running bear_predict on linear-2.bear.gz

we see that Bear has modeled the 9 data points exactly! Of course, that’s only because our dataset had no noise at all: every feature value mapped exactly to a single label value for every one of its 20 examples, and Bear decided that each of these mappings was statistically significant in itself. This perfect modeling is reflected in Bear declaring the “construction weight” (which I will describe later) to be 10,000,000,000, which is an arbitrary upper bound that I apply in the code. Real datasets will not generally be both noiseless and statistically significant.

Looking more at my above play with bear_predict, you can see that if we specify a feature value between two values in our original dataset—here 1.4 and 1.6—Bear doesn’t linearly interpolate, like you might expect from linear regression; rather it gives us the prediction 13 of feature value 1 for the former, and the prediction 23 of feature value 2 for the latter. It seems to be using the nearest feature value in the original dataset. Moreover, if we specify the feature to be less than 1, it predicts the smallest label, 13; if we specify the feature to be greater than 9, it predicts the largest label, 93, so it doesn’t extrapolate either. These are general features of Bear: its predictions are piecewise constant, and do not exceed the bounds of the input label data. In this case there are 9 of these pieces, which surround each of the 9 feature values in the input dataset, with the pieces on the ends continuing on to negative and positive infinity.

To show this more explicitly, I have created a file linear-2-test-features.csv of feature values spanning the interval from −2 to +12, stepping by 0.1 each time. We can pass those into bear_predict, and ask it to write its predictions out to the file linear-2-test-out.csv:

Running a file of test feature value through bear_predict

Running a file of test feature value through bear_predict

You can graph the results using whatever program you like; for simplicity, I have just used Microsoft Excel:

Scatterplot of linear-2-test-out.csv using Microsoft Excel

Scatterplot of linear-2-test-out.csv using Microsoft Excel

This shows you visually that Bear has created its “perfect” model of this noiseless data as piecewise constant.

Quantized features and the paw floating point representation

If you play around with bear_predict some more, you will find that the prediction does indeed jump up to 23 at a feature value of 1.5, or half-way between the two input feature values of 1 and 2. But if you bisect even more, you might be surprised that it actually jumps up at around 1.498046875. What’s going on here?

The answer is that Bear internally uses a custom 16-bit floating point representation, that I dubbed “paw,” in the core engine that does the statistical modeling. The paw format is very similar to Google Brain’s bfloat16 format, except that paw has 7 bits of exponent and 8 bits of mantissa, whereas bfloat16 has 8 bits of exponent and 7 bits of mantissa. Google chose one less bit of precision that I did for Bear because they had competing design goals due to a legacy codebase that made it advantageous for bfloat16 to have the same dynamic range as the standard 32-bit float. I had no such constraints, and could let paw have one extra bit of precision, since the dynamic range of paw of around 10±19 is more than sufficient for all practical purposes, compared to around 10±38 for bfloat16.

The result is that feature values greater than 1.5 − 1 / 512 round up to 1.5 in this core modeling.

Because Bear’s models are piecewise constant in feature space anyway, you would assume that this quantization of the thresholds between adjacent pieces would usually have no significant practical ramifications. But what if we were to shift all of these feature values to the right, by, say, 1,000,000?

The file linear-2-offset.csv, which is a shift of linear-2.csv to the right by
         1,000,000

The file linear-2-offset.csv, which is a shift of linear-2.csv to the right by 1,000,000.

You might guess that all of these feature values would be quantized to the same paw value. But if you run Bear on this dataset, you find that it produces identical results, just shifted to the right by 1,000,000. How did it manage this?

If you examine the model created, you will see that it now has a “Last part feature offsets” section.

FINISH

You might think that this is an artificially contrived example. But actually it’s not: what if one of your features is a Unix time? Those values will in most applications be in the billions, but will likely only vary by millions or less. With paw precision no better than about 0.1%, in many cases it would round every such Unix time to the same paw value, and you would lose it as a potential feature.

Examining the model file

So now that we have more than an empty model, let’s examine in more details what’s actually in linear-2.bear.gz. Let’s start with the non-verbose version of the program:

Details of linear-2.bear.gz, not in verbose mode

Details of linear-2.bear.gz, not in verbose mode

We still only have one model (labeled with the index 0), but now it has a “weight” with that 1010 upper-bound we saw above, and it has an “assembly” which is “e|0” rather than just the “e” we had before. I’ll describe these “weights” in more detail in later tutorials, but for now just take it as the “goodness” of a model. The “assembly” e|0 just tells us that this model has used the empty model, and then modeled its residuals with feature 0. (This will become clearer when we have more complicated models.)

If you now run this command in verbose mode,

$ bear_model_details linear-2 -v

then you will see essentially all the internal details of this Bear model file. Without getting into the weeds of those details, if you read from the bottom you will see that Bear models the labels with the empty model, and then tries to model the residuals of that model (which are between −40 and +40) with the feature that we have supplied. In this case it succeeded in finding statistical significance in that residual modeling.

Training and prediction in a single run

As a convenience, memory_bear lets you include prediction feature values in the same input file as your training data, and it will make predictions for those feature values after it finishes creating its model. All you need to do is include those prediction feature rows in your input file with an empty label field. For example,

$ cat linear-2.csv linear-2-test-features.csv > linear-2-combined.csv

simply appends the prediction feature rows to the training rows. The label values in column 1 are implicitly missing for these rows (since there is no column 1), which marks them as prediction rows. Frequencies are never needed for prediction rows, so it doesn’t matter that column 2 is also missing for these rows.

We now have to specify an output filename for the predictions to be written out to. (In this mode, it is optional whether you want to save the Bear model to a file or not.) So the command

$ memory_bear linear-2-combined.csv 1m -l1 -fc2 -o linear-2-predictions.csv

trains the model and then makes predictions for our 141 prediction rows, writing the results out to linear-2-predictions.csv. We can easily prove that the predictions are identical to those obtained above:

$ cmp linear-2-predictions.csv linear-2-test-out.csv

Including identifier or other text passthrough columns

Bear also allows you to specify that one or more columns in your input data file should be simply passed through as plain text to the corresponding row of the output file, without playing any role in the actual modeling or predictions. This can be useful if one of your columns is a primary key, or if multiple columns together form a composite primary key, or even if some columns are simply comments or other descriptive text. For example, if we add an identifier column and a comment column to linear-2.csv, and add a few prediction rows, to create ids.csv:

The data file ids.csv

The data file ids.csv

and then specify to memory_bear that columns 0 and 4 are “ID” (passthrough) columns,

$ memory_bear ids.csv 1m --multi-id-columns='[0,4]' -l2 -fc3 -o ids-out.csv

then we can see that these two columns are ignored, but passed through for the prediction rows to the output:

The output file ids-out.csv

The output file ids-out.csv

If you specify one or more identifier columns in this way, you may not actually need or want to see the actual feature value(s) for those rows. To suppress their output you can just specify --no-features-out:

$ memory_bear ids.csv 1m -j'[0,4]' -l2 -fc3 -o ids-out-nf.csv --no-features-out

Now in the output you just see your ID columns and the corresponding predicted label:

The output file ids-out-nf.csv

The output file ids-out-nf.csv

Note on column ordering for output files

Although you can specify to memory_bear and bear_predict any arbitrary columns to be labels or identifiers, both programs write out predictions with all identifiers first, followed by all features (unless specified otherwise), followed by all labels, in each case in the order that the columns appeared in the input data. If you need an alternative permutation of the columns in the output file you should use another utility to achieve that result.

A more realistic dataset

We’ve played enough with our noiseless dataset linear-2.csv, so let’s generate some data that at least has some noise added to it. You can do this yourself using whatever program you like, but I’ll use the supplied program simple_bear_tutorial_data so that you have the same data:

$ simple_bear_tutorial_data linear-3.csv -r19680707

which creates the file linear-3.csv with 50 training rows and 250 prediction rows in it. The final argument -r19680707 simply ensures that you seed the random number generator the same as I did, so that you get exactly the same data. If you graph the data you should see something like this:

Scatterplot of linear-3.csv

Scatterplot of linear-3.csv

If you now run memory_bear on this data,

$ memory_bear linear-3.csv 1m -l1 -o linear-3-predictions.csv

you should now see a “construction weight” of around 8.25. You don’t have anything to compare this with, yet, but at least it doesn’t sound as silly as the 1010 we got for the perfect model. Graphing linear-3-predictions.csv you should see something like

Scatterplot of linear-3-predictions.csv

Scatterplot of linear-3-predictions.csv

Again, it is piecewise constant, as Bear’s models always are. Indeed, Bear’s model here is like a decision tree on its single feature, where it has determined all the decision points at once. When we add more features the similarities with decision trees will remain evident, but so too will be the differences with how Bear’s algorithms determine the decision points for each feature.

It would be nice to be able to see Bear’s predictions on the same axes as the input data. The memory_bear program makes that easy, by using the --debug flag:

$ memory_bear linear-3.csv 1m -l1 -d -o linear-3-debug.csv

Opening linear-3-debug.csv, you should see that the first 50 rows are just the original data, with two extra columns that I’ll return to shortly, followed by the 250 prediction rows. If we graph just the first two columns, we get what we wanted:

Scatterplot of the first two colummns of linear-3-debug.csv

Scatterplot of the first two colummns of linear-3-debug.csv

We see that Bear has done a pretty good job of extracting out some piecewise constant dependencies, given the amount of data available and the amount of noise present.

But is this really the best that Bear could do under these circumstances? Apart from simply believing me that this is about as much that can be extracted with statistical significance, without any other a priori knowledge of the dependence of the label on the feature, we can also look at the residuals of this model. This is where the two extra columns in debug mode are useful. The third column just provides us Bear’s predictions for the training examples:

Bear's predictions for the training examples

Bear’s predictions for the training examples

and the fourth column provides the residuals of the training labels over these predictions:

The residuals for Bear's modeling of linear-3.csv

The residuals for Bear’s modeling of linear-3.csv

Visually, this looks pretty convincing: there are no clear areas where a piecewise constant model would fit these residuals with any degree of statistical confidence.

Changing the random seed

We’ve seen that Bear has done a reasonable job of modeling noisy data with a linear dependence with 50 data points. But is that specific to the particular dataset that I created above? What if we change the random seed? For example,

$ simple_bear_tutorial_data linear-4.csv -r19660924

which creates the dataset

Scatterplot of linear-4.csv

Scatterplot of linear-4.csv

which actually looks a little “smoother” than linear-3.csv. (Of course, this is all just due to the random noise.) Running Bear on this dataset,

$ memory_bear linear-4.csv 1m -dl1 -olinear-4-predictions.csv

we see in linear-4-predictions.csv that it now only decided to split the feature into two pieces:

Scatterplot of linear-4-predictions.csv

Scatterplot of linear-4-predictions.csv

In effect, Bear also “saw” the “lumpiness” of the middle portion of linear-3.csv, which wasn’t repeated in linear-4.csv, and deemed it sufficiently “lumpy” to create a piece there. Bear doesn’t know if structure that it sees in the input data is representative of the underlying relationship or just random noise, just like we don’t (if we don’t look at simple_bear_tutorial_data to learn how the pseudorandom data was generated, of course!), and forms its best guess based on the statistical significance of what it does have.

But still, looking at the scatterplot above, we might wonder if Bear might not have squeezed out a third piece, since there is such an “obvious” linear variation in each of the two pieces it has. But if we look at the actual residuals of that model,

The residuals of linear-4-predictions.csv

The residuals of linear-4-predictions.csv

then it becomes less clear. Certainly, there is not enough data to split these residuals into a statistically significant piecewise model. But that’s based on the two pieces that Bear actually found; our question is whether it could have alternatively found three pieces. Even doing it by eye, it is difficult to see how Bear could have done this. Moreover, note that Bear does not try every possible splitting of the feature, not only because this would not be computationally tractable, but also because the exponential explosion in the number of decisions would hurt Bear’s ability to find statistical significance at all, since it keeps track of the “multiple comparisons” problem.

Adding more training data

We can also look at what happens when we add more training data. Let’s return to the original random seed, and specify that we want 1000 rows of training data rather than the default 50:

$ simple_bear_tutorial_data linear-5.csv -r19680707 -t1000

which creates

Scatterplot of linear-5.csv

Scatterplot of linear-5.csv

Running memory_bear,

$ memory_bear linear-5.csv 1m -dl1 -olinear-5-predictions.csv

now yields

Scatterplot of linear-5-predictions.csv

Scatterplot of linear-5-predictions.csv

where we now have a model with six pieces. The residuals again look reasonable:

The residuals of linear-5-predictions.csv

The residuals of linear-5-predictions.csv

Looking at them and the modeling above, you could almost imagine breaking some of the pieces in half. But that “always” is the point: there is just not enough statistical significance in the amount of data we have for each piece to overcome the inherent noise in the data.

Of course, if you add more and more data, there is more opportunity for extra structure to be resolved despite the noise. Using 10K data points gives you 8 pieces; using 100K gives you 12 pieces; using 1M gives you 25 pieces; and using 10M gives you 54 pieces. (This is easiest to see if you save the Bear model and inspect it using bear_model_details in verbose mode.)

Incidentally, if you have looked at the help screen for simple_bear_tutorial_data you will have seen that the default underlying relationship between the label y and feature x is actually y = 3 x + 10, which has been well modeled by Bear.

Missing data

In the real world you will often be missing data for some features for some examples. Bear handles missing feature values.

To see how this works, let’s create a dataset like linear-3.csv, but with around half of the rows having a missing feature value. We can do this using the --missing-percentage option to simple_bear_tutorial_data:

$ simple_bear_tutorial_data missing-1.csv -r19680707 -n50 -t100

where -n50 sets this “missing percentage” to 50%. I’ve also upped the total number of training rows to 100 so that about 50 of them will still have feature data. Indeed, if you inspect missing-1.csv you will see that there are label values for the first 100 rows, but for 55 of them there is no feature value:

The first 10 rows of missing-1.csv

The first 10 rows of missing-1.csv

Note that the label values for examples with missing features are clustered around 110. This is because the --missing-bias default is 100, which is an extra bias added to the label of all rows with a missing feature value, in addition to the default --bias of 10, so that the expectation value of the label for examples with a missing feature is 110. (The default --weight of 3 does not come into play, because there are no feature values to be correlated with for these examples.)

Usually, after the training examples we see the prediction examples. But in this file we see a row with no values at all:

The 91st through 110th row of missing-1.csv

The 91st through 110th row of missing-1.csv

This is actually a prediction row (since its label is missing), but for the case when the feature value is missing. After this row are the standard 250 prediction rows that the program has given us each time.

If you graph the 45 examples in missing.csv that do not have a missing feature value, you will see that they follow the same general pattern as linear-3.csv and linear-4.csv:

The 45 training examples in missing-1.csv that do not have a
    missing feature value

The 45 training examples in missing-1.csv that do not have a missing feature value

Running memory_bear on this data, and saving the Bear model file,

$ memory_bear missing-1.csv -dl1 1m -omissing-1-predictions.csv -smissing-1

we see from missing-1-predictions.csv,

The 91st through 110th row of missing-1-predictions.csv

The 91st through 110th row of missing-1-predictions.csv

that the prediction for a missing feature value is almost 110, and if we graph the predictions for the examples without missing feature values,

Training data and predictions for the examples without a
    missing feature value

Training data and predictions for the examples without a missing feature value

that Bear has modeled these similarly to the datasets above without missing feature values.

You might have noticed Bear quoting its “construction weight” as over 761! Again, we haven’t yet discussed what these “weights” actually are, quantitatively, but 761 seems significantly better than the single-digit weights previously noted. We can get some insight into what is going on here if we inspect the model file, in verbose mode:

$ bear_model_details missing-1 -v

There is a fair bit of detail in the output, but if you read it from the bottom, you will see two models listed in a parent–child chain.

First, there is the empty model, which makes a constant prediction of around 64.5. This the mean value of all labels in the input dataset.

Next is a model with feature 0. Bear says the “completeness” model is nontrivial. This models whether each example is “complete,” i.e., does not have any missing feature values. In this case, the labels for examples with missing value were found to be statistically significantly different from those with supplied values. It predicts around 45.3 on top of the empty-model prediction of around 64.5 for examples that are incomplete, yielding an overall prediction of around 109.8, and subtracts around 55.3 from the empty-model prediction of 64.5 for examples that are complete, yielding a prediction of around 9.2.

After that, Bear tells us that the “complete model” is nontrivial. This models the residuals of the completeness model above, using the feature value, for just the complete examples (because it doesn’t have any feature value for the incomplete examples!). Its two piecewise-constant pieces are what is shown in the graph above.

When Bear computes a “weight,” it is always normalized by reference to that of the empty model. Here the empty model is quite bad (but the best that can be done without any features): all of the actual label values are far above or below its constant prediction of around 64.5. The completeness model actually provides most of the improvement in this particular case, and the complete model against the feature provides some further improvement, ultimately giving the construction a “weight” of over 761, compared to the empty model.

Simpler examples with missing values

The example above shows that Bear can handle missing feature values without needing to discard either features or examples. Let’s simplify the dataset so that we can see more clearly what Bear is doing. The dataset missing-2.csv has statistically significant frequencies like we had in linear-2.csv, but with just three feature values and corresponding label values, plus a missing feature:

The dataset missing-2.csv

The dataset missing-2.csv

Running Bear on this dataset,

$ memory_bear missing-2.csv -dHLlabel -fCfrequency 1m -omissing-2-predictions.csv -smissing-2

we see that the predictions are perfect, so that the residuals are all zero:

The predictions for missing-2.csv, in debug mode

The predictions for missing-2.csv, in debug mode

Looking at the model, we see that the empty label prediction is 70 (the mean of the labels), the completeness model is nontrivial, predicting an additional 50 for incomplete examples and −50 for complete examples, and the nontrivial complete model predicts an additional −10, 0, or +10 for the complete examples.

Let’s now change the missing example label:

The dataset missing-3.csv

The dataset missing-3.csv

Running Bear on this,

$ memory_bear missing-3.csv -dHLlabel -fCfrequency 1m -omissing-3-predictions.csv -smissing-3

the predictions are still perfect. The model file shows that the incompleteness model is still nontrivial, but the incomplete and complete predictions are both zero. How can that be?

The answer is that the distribution of residuals is statistically significantly different for incomplete and complete examples, but it just so happens that the mean of each is zero. If we change the distribution of label values for the incomplete examples to exactly match that of the complete examples:

The dataset missing-4.csv

The dataset missing-4.csv

and run Bear on it:

$ memory_bear missing-4.csv -dHLlabel -fCfrequency 1m -omissing-4-predictions.csv -smissing-4

then we see that not only is the model no longer perfect, for the incomplete examples:

The predictions for missing-4.csv, in debug mode

The predictions for missing-4.csv, in debug mode

but the model file now indicates that the completeness model is trivial.

Now consider the dataset

The dataset missing-5.csv

The dataset missing-5.csv

In this case the completeness model is nontrivial, but the complete model is trivial (as it must be, since there is only one distinct feature value). Finally,

The dataset missing-6.csv

The dataset missing-6.csv

doesn’t have enough statistical significance for a nontrivial complete model, and the completeness model is trivial because the distribution of the incomplete examples is identical to that of the complete examples, and so Bear found the model using the feature to be trivial overall, and hence discarded it, leaving just the empty model.

Note on compressed files

Note that my libraries automatically handle text files that are compressed with gzip. All that you need to do is specify a filename that ends in .gz, and it will all happen automagically. The command gzcat is a useful analog of cat for such files. Note that Bear always saves its model file in compressed format.

Intermediate tutorial

If you have followed along with (and hopefully enjoyed) all of the above, then feel free to move on to the intermediate tutorial.