You can learn how Bear works by reading or working through this simple tutorial.
You have two options:
All the commands I use below are listed here.
If you have built Bear and added it to your path, then execute this command:
You should see output that looks something like this (details of all screenshots may vary):
You can see from the “Arguments:” section at the bottom of this screenshot that memory_bear has two mandatory arguments: INPUT_FILENAME and TIME_BUDGET. Let’s just create an empty input file,
$ touch empty.csv
and run memory_bear on it, specifying a time budget of, say, one minute:
Okay, so we learn that it’s mandatory to specify the label column(s) using one of these four options. Let’s just specify it as column 0:
So we also need to specify a filename for saving the Bear model or writing an output file with predictions (or both). Let’s just specify a Bear model filename:
Now we’re getting somewhere! Bear fired up with a welcome message, and then some feedback to us of its parsing of what we have asked of it. At the left side of each log line you will always see the local time (to the minute) and the time that has elapsed since the last log line. (The first log line tells you the local date when execution started.)
We can see that Bear did a first pass over the input file, but then it told us that an empty data file isn’t allowed!
So let’s create the simplest possible dataset: a single example (row), with no features, and just a label value, in single.csv:
We can now run memory_bear on this dataset successfully:
All of these steps will become clearer as we work through these tutorials, but in the end we see that Bear built a model and saved it to single.bear.gz. If we take a look at the decompressed bytes in that file,
we can see that it consists of binary data within plain text tags; this is the general way that Bear saves objects. It runs it through gzip to compress these 179 bytes down to 73.
We can get an overview of what is in this model file using the supplied program bear_model_details:
That’s not too enlightening in this case! We can get a bit more detail using the --verbose option:
We see that there is just an “empty” model, which is the model that Bear creates without using any features at all. This makes sense, because there were no features! This empty model records that the minimum and maximum allowed label values are 42, because this was the only label value in the input data (and Bear never extrapolates), and its single prediction for the label is likewise 42.
We can stream feature data through this model and get it to make predictions using the supplied program bear_predict. Its command-line options are similar to those of memory_bear:
As the argument specifications at the bottom of this help screen show, we can run it in “interactive” mode by specifying stdin and stdout for the input and output files, although we need to specify the filetypes for each:
At this point, the program is waiting for us to specify feature values for an example. In this case there are no features, so if we hit the return key, it spits out its prediction:
We can do this as many times as we want:
If we’ve taken more than five seconds to do this, we'll even be given a “progress update” on the number of rows processed so far:
After doing this a fourth time, the fun has probably worn off, and we can finish our input by pressing control-D and return:
Let’s make things slightly more interesting by having more than one example in our dataset. For example, 10-labels.csv:
If we run memory_bear on this dataset, now using short option names,
$ memory_bear 10-labels.csv 1m -l 0 -s 10-labels
and look at the details of the model created, 10-labels.bear.gz,
we can see that the empty model now records a minimum allowed label of −370, a maximum allowed label of 120, and a constant prediction of −23.2. The first two of these are just the bounds of the 10 input labels; Bear never extrapolates beyond the data it is given. Likewise, the prediction of −23.2 is just the mean value of those 10 input labels, which minimizes the MSE loss (the default for Bear) if the empirical probabilities are taken as the best estimate of the true probability distribution.
As before, we can run bear_predict in interactive mode to stream example feature data (again, here we have no features) through the model:
where this time our fun was expended after two hits of the return key, after which I hit control-D and return to end the input datastream.
Let’s now add a frequency column to our 10 examples, to create frequencies.csv:
This just means that we have 4 examples with a label value of −1.3, one example with 4.7, and so on. This is completely equivalent to having a data file with four rows with label value −1.3, etc.
We can specify that our input file has a frequency column by using the --has-frequency-column and --frequency-column options (here in their short forms -f and -c):
$ memory_bear data-frequencies.csv 1m -l0 -f -c1 -sfrequencies
which Bear parses and includes in its feedback to us:
We now see that the model, frequencies.bear.gz, is similar to the previous one,
except that the prediction is now −35.84838709677419. Note that my codebase automatically includes separators in its logging, but these are never added in output files. We can confirm both of these points if we run bear_predict on the model:
This is just the weighted average of the input label values, where each weight is just the relative frequency; e.g., for the first label value of −1.3 it is 4 / 31, since the total frequency is 31; and so on.
OK, enough with datasets with just labels an no features. Let’s add a feature! Here is a simple dataset linear-1.csv where the second (label) column is obviously linearly dependent on the first (feature) column:
We know how to run Bear on this, where we now just have to specify that the label column is column 1 (i.e., the second column):
$ memory_bear linear-1.csv 1m -l1 -slinear-1
We see Bear doing a lot more than it did for the empty models. Skipping these details, for the moment, we excitedly run bear_predict to see the results of our linear regression, now typing a feature value before hitting return:
Well that was disappointing! No matter what feature value we entered, the model gave us a label prediction of 53. It even did this if we didn’t specify a feature value at all!
Why didn’t we get any linear regression? We can again examine the model linear-1.bear.gz, to try to debug this:
This is just the empty model again! Its constant prediction of 53 is just the mean value of the input labels. But why did Bear just give us the empty model?
The answer is that Bear only gives us statistically significant structure that it finds in the data. In this case it decided that these 9 data points didn’t give it any statistically significant signal of anything more than just the empty model. And that sounds fair enough: without any other information about what sort of relationship we are expecting to find, in general it would be difficult to draw any concrete conclusions from just 9 data points.
So let’s give Bear more of our linear data, so that it might have a chance of finding something statistically significant. One easy way to do that is to add a frequency column to our dataset, and set the frequency of each of our nine examples to, say, 20:
If we run Bear on this,
$ memory_bear linear-2.csv 1m -l1 -fc2 -slinear-2
and then run bear_predict on the created model linear-2.bear.gz,
we see that Bear has modeled the 9 data points exactly! Of course, that’s only because our dataset had no noise at all: every feature value mapped exactly to a single label value for every one of its 20 examples, and Bear decided that each of these mappings was statistically significant in itself. This perfect modeling is reflected in Bear declaring the “overall model strength” (which I will describe later) to be 10,000,000,000, which is an arbitrary upper bound that I apply in the code. Real datasets will not generally be both noiseless and statistically significant.
Looking more at my above play with bear_predict, you can see that if we specify a feature value between two values in our original dataset—here 1.4 and 1.6—Bear doesn’t linearly interpolate, like you might expect from linear regression; rather it gives us the prediction 13 of feature value 1 for the former, and the prediction 23 of feature value 2 for the latter. It seems to be using the nearest feature value in the original dataset. Moreover, if we specify the feature to be less than 1, it predicts the smallest label, 13; if we specify the feature to be greater than 9, it predicts the largest label, 93, so it doesn’t extrapolate either. These are general features of Bear: its predictions are piecewise constant, and do not exceed the bounds of the input label data. In this case there are 9 of these pieces, which surround each of the 9 feature values in the input dataset, with the pieces on the ends continuing on to negative and positive infinity.
To show this more explicitly, I have created a file linear-2-test-features.csv of feature values spanning the interval from −2 to +12, stepping by 0.1 each time. We can pass those into bear_predict, and ask it to write its predictions out to the file linear-2-test-out.csv:
You can graph the results using whatever program you like; for simplicity, I have just used Microsoft Excel:
This shows you visually that Bear has created its “perfect” model of this noiseless data as piecewise constant.
If you play around with bear_predict some more, you will find that the prediction does indeed jump up to 23 at a feature value of 1.5, or half-way between the two input feature values of 1 and 2. But if you bisect even more, you might be surprised that it actually jumps up at around 1.498046875. What’s going on here?
The answer is that Bear internally uses a custom 16-bit floating point representation, that I dubbed “paw,” in the core engine that does the statistical modeling. The paw format is very similar to Google Brain’s bfloat16 format, except that paw has 7 bits of exponent and 8 bits of mantissa, whereas bfloat16 has 8 bits of exponent and 7 bits of mantissa. Google chose one less bit of precision that I did for Bear because they had competing design goals due to a legacy codebase that made it advantageous for bfloat16 to have the same dynamic range as the standard 32-bit float. I had no such constraints, and could let paw have one extra bit of precision, since the dynamic range of paw of around 10±19 is more than sufficient for all practical purposes, compared to around 10±38 for bfloat16.
The result is that feature values greater than 1.5 − 1 / 512 round up to 1.5 in this core modeling.
Because Bear’s models are piecewise constant in feature space anyway, this quantization of the thresholds between adjacent pieces usually has no significant practical ramifications.
Note, however, that if you have a feature that is not an extensive quantity—for example, temperatures, or Unix timestamps that are arbirarily referenced to 1970, or positions in space referenced to some arbitrary origin—then this limited precision could quantize all the values of such a feature to the same quantized value. In such cases you should transform the feature into a sensible range; for example, timestamps relative to the earliest time in the data.
In completely pathological situations where even this doesn’t work—for example, values bunched around two particular values, and the deviations around those values are likely significant—you can always transform those values in any nonlinear but monotonically nondecreasing way you like. Bear only cares about the ordering of the feature values, not their actual numerical values, except when interpolating between the “fold” (“break”) points that it inserts between feature values in your training data (like between 1 and 2 above).
So now that we have more than an empty model, let’s examine in more details what’s actually in linear-2.bear.gz. Let’s start with the non-verbose version of the program:
We still only have one model (labeled with the index 0), but now it has a “weight” with that 1010 upper-bound “strength” we saw above, and it has an “assembly” which is “e|0” rather than just the “e” we had before. I’ll describe these “weights” in more detail in later tutorials, but for now just take it as the “goodness” of a model. The “assembly” e|0 just tells us that this model has used the empty model, and then modeled its residuals with feature 0. (This will become clearer when we have more complicated models.)
If you now run this command in verbose mode,
$ bear_model_details linear-2 -v
then you will see essentially all the internal details of this Bear model file. Without getting into the weeds of those details, if you read from the bottom you will see that Bear models the labels with the empty model, and then tries to model the residuals of that model (which are between −40 and +40) with the feature that we have supplied. In this case it succeeded in finding statistical significance in that residual modeling.
As a convenience, memory_bear lets you include prediction feature values in the same input file as your training data, and it will make predictions for those feature values after it finishes creating its model. All you need to do is include those prediction feature rows in your input file with an empty label field. For example,
$ cat linear-2.csv linear-2-test-features.csv > linear-2-combined.csv
simply appends the prediction feature rows to the training rows. The label values in column 1 are implicitly missing for these rows (since there is no column 1), which marks them as prediction rows. Frequencies are never needed for prediction rows, so it doesn’t matter that column 2 is also missing for these rows.
We now have to specify an output filename for the predictions to be written out to. (In this mode, it is optional whether you want to save the Bear model to a file or not.) So the command
$ memory_bear linear-2-combined.csv 1m -l1 -fc2 -o linear-2-predictions.csv
trains the model and then makes predictions for our 141 prediction rows, writing the results out to linear-2-predictions.csv. We can easily prove that the predictions are identical to those obtained above:
$ cmp linear-2-predictions.csv linear-2-test-out.csv
Bear also allows you to specify that one or more columns in your input data file should be simply passed through as plain text to the corresponding row of the output file, without playing any role in the actual modeling or predictions. This can be useful if one of your columns is a primary key, or if multiple columns together form a composite primary key, or even if some columns are simply comments or other descriptive text. For example, if we add an identifier column and a comment column to linear-2.csv, and add a few prediction rows, to create ids.csv:
and then specify to memory_bear that columns 0 and 4 are “ID” (passthrough) columns,
$ memory_bear ids.csv 1m --multi-id-columns='[0,4]' -l2 -fc3 -o ids-out.csv
then we can see that these two columns are ignored, but passed through for the prediction rows to the output:
If you specify one or more identifier columns in this way, you may not actually need or want to see the actual feature value(s) for those rows. To suppress their output you can just specify --no-features-out:
$ memory_bear ids.csv 1m -j'[0,4]' -l2 -fc3 -o ids-out-nf.csv --no-features-out
Now in the output you just see your ID columns and the corresponding predicted label:
Although you can specify to memory_bear and bear_predict any arbitrary columns to be labels or identifiers, both programs write out predictions with all identifiers first, followed by all features (unless specified otherwise), followed by all labels, in each case in the order that the columns appeared in the input data. If you need an alternative permutation of the columns in the output file you should use another utility to achieve that result.
We’ve played enough with our noiseless dataset linear-2.csv, so let’s generate some data that at least has some noise added to it. You can do this yourself using whatever program you like, but I’ll use the supplied program simple_bear_tutorial_data so that you have the same data:
$ simple_bear_tutorial_data linear-3.csv -r19680707
which creates the file linear-3.csv with 50 training rows and 250 prediction rows in it. The final argument -r19680707 simply ensures that you seed the random number generator the same as I did, so that you get exactly the same data. If you graph the data you should see something like this:
If you now run memory_bear on this data,
$ memory_bear linear-3.csv 1m -l1 -o linear-3-predictions.csv
you should now see an “overall model strength” of around 8.25. You don’t have anything to compare this with, yet, but at least it doesn’t sound as silly as the 1010 we got for the perfect model. Graphing linear-3-predictions.csv you should see something like
Again, it is piecewise constant, as Bear’s models always are. Indeed, Bear’s model here is like a decision tree on its single feature, where it has determined all the decision points at once. When we add more features the similarities with decision trees will remain evident, but so too will be the differences with how Bear’s algorithms determine the decision points for each feature.
It would be nice to be able to see Bear’s predictions on the same axes as the input data. The memory_bear program makes that easy, by using the --debug flag:
$ memory_bear linear-3.csv 1m -l1 -d -o linear-3-debug.csv
Opening linear-3-debug.csv, you should see that the first 50 rows are just the original data, with two extra columns that I’ll return to shortly, followed by the 250 prediction rows. If we graph just the first two columns, we get what we wanted:
We see that Bear has done a pretty good job of extracting out some piecewise constant dependencies, given the amount of data available and the amount of noise present.
But is this really the best that Bear could do under these circumstances? Apart from simply believing me that this is about as much that can be extracted with statistical significance, without any other a priori knowledge of the dependence of the label on the feature, we can also look at the residuals of this model. This is where the two extra columns in debug mode are useful. The third column just provides us Bear’s predictions for the training examples:
and the fourth column provides the residuals of the training labels over these predictions:
Visually, this looks pretty convincing: there are no clear areas where a piecewise constant model would fit these residuals with any degree of statistical confidence.
We’ve seen that Bear has done a reasonable job of modeling noisy data with a linear dependence with 50 data points. But is that specific to the particular dataset that I created above? What if we change the random seed? For example,
$ simple_bear_tutorial_data linear-4.csv -r19660924
which creates the dataset
which actually looks a little “smoother” than linear-3.csv. (Of course, this is all just due to the random noise.) Running Bear on this dataset,
$ memory_bear linear-4.csv 1m -dl1 -olinear-4-predictions.csv
we see in linear-4-predictions.csv that it now only decided to split the feature into two pieces:
In effect, Bear also “saw” the “lumpiness” of the middle portion of linear-3.csv, which wasn’t repeated in linear-4.csv, and deemed it sufficiently “lumpy” to create a piece there. Bear doesn’t know if structure that it sees in the input data is representative of the underlying relationship or just random noise, just like we don’t (if we don’t look at simple_bear_tutorial_data to learn how the pseudorandom data was generated, of course!), and forms its best guess based on the statistical significance of what it does have.
But still, looking at the scatterplot above, we might wonder if Bear might not have squeezed out a third piece, since there is such an “obvious” linear variation in each of the two pieces it has. But if we look at the actual residuals of that model,
then it becomes less clear. Certainly, there is not enough data to split these residuals into a statistically significant piecewise model. But that’s based on the two pieces that Bear actually found; our question is whether it could have alternatively found three pieces. Even doing it by eye, it is difficult to see how Bear could have done this. Moreover, note that Bear does not try every possible splitting of the feature, not only because this would not be computationally tractable, but also because the exponential explosion in the number of decisions would hurt Bear’s ability to find statistical significance at all, since it keeps track of the “multiple comparisons” problem.
We can also look at what happens when we add more training data. Let’s return to the original random seed, and specify that we want 1000 rows of training data rather than the default 50:
$ simple_bear_tutorial_data linear-5.csv -r19680707 -t1000
$ memory_bear linear-5.csv 1m -dl1 -olinear-5-predictions.csv
where we now have a model with six pieces. The residuals again look reasonable:
Looking at them and the modeling above, you could almost imagine breaking some of the pieces in half. But that “always” is the point: there is just not enough statistical significance in the amount of data we have for each piece to overcome the inherent noise in the data.
Of course, if you add more and more data, there is more opportunity for extra structure to be resolved despite the noise. Using 10K data points gives you 8 pieces; using 100K gives you 12 pieces; using 1M gives you 25 pieces; and using 10M gives you 54 pieces. (This is easiest to see if you save the Bear model and inspect it using bear_model_details in verbose mode.)
Incidentally, if you have looked at the help screen for simple_bear_tutorial_data you will have seen that the default underlying relationship between the label y and feature x is actually y = 3 x + 10, which has been well modeled by Bear.
In the real world you will often be missing data for some features for some examples. Bear handles missing feature values.
To see how this works, let’s create a dataset like linear-3.csv, but with around half of the rows having a missing feature value. We can do this using the --missing-percentage option to simple_bear_tutorial_data:
$ simple_bear_tutorial_data missing.csv -r19680707 -n50 -t100
where -n50 sets this “missing percentage” to 50%. I’ve also upped the total number of training rows to 100 so that about 50 of them will still have feature data. Indeed, if you inspect missing.csv you will see that there are label values for the first 100 rows, but for 55 of them there is no feature value:
Note that the label values for examples with missing features are clustered around 110. This is because the --missing-bias default is 100, which is an extra bias added to the label of all rows with a missing feature value, in addition to the default --bias of 10, so that the expectation value of the label for examples with a missing feature is 110. (The default --weight of 3 does not come into play, because there are no feature values to be correlated with for these examples.)
Usually, after the training examples we see the prediction examples. But in this file we see a row with no values at all:
This is actually a prediction row (since its label is missing), but for the case when the feature value is missing. After this row are the standard 250 prediction rows that the program has given us each time.
If you graph the 45 examples in missing.csv that do not have a missing feature value, you will see that they follow the same general pattern as linear-3.csv and linear-4.csv:
Running memory_bear on this data, and saving the Bear model file,
$ memory_bear missing.csv -dl1 1m -omissing-predictions.csv -smissing
we see from missing-predictions.csv,
that the prediction for a missing feature value is almost 110, and if we graph the predictions for the examples without missing feature values,
that Bear has modeled these similarly to the datasets above without missing feature values.
You might have noticed Bear quoting its “overall model strength” as over 761! Again, we haven’t yet discussed what these “strengths” and “weights” actually are, quantitatively, but 761 seems significantly better than the single-digit strengths previously noted. We can get some insight into what is going on here if we inspect the model file, in verbose mode:
$ bear_model_details missing -v
There is a fair bit of detail in the output, but if you read it from the bottom, you will see three models listed in a parent–child chain:
When Bear computes a “weight” or “strength,” it is always normalized by reference to that of the empty model. Here the empty model is quite bad (but the best that can be done without any features): all of the actual label values are far above or below its constant prediction of around 64.5. The completeness model actually provides most of the improvement in this particular case, and the regular model against the feature provides some further improvement, ultimately giving the complete model a “strength” of over 761, compared to the empty model.
This example shows that Bear can handle missing feature values without needing to discard either features or examples.
And that’s about all that we’re going to look at for data that can be well-modeled using linear regression!
Note that my libraries automatically handle text files that are compressed with gzip. All that you need to do is specify a filename that ends in .gz, and it will all happen automagically. The command gzcat is a useful analog of cat for such files. Note that Bear always saves its model file in compressed format.
If you have followed along with (and hopefully enjoyed) all of the above, then feel free to move on to the intermediate tutorial.
© 2022–2023 Dr. John P. Costella