Intermediate Bear tutorial

If you have not worked through the simple tutorial, where I introduce you to Bear step by step with very simple examples, then I strongly recommend that you at least read through it first.

All the commands I use below are listed here.

Nonlinear dependence

The datasets I showed you in the simple tutorial were all either trivial, or else based on a linear dependence of the label on the feature. Let’s immediately move to a dataset that is manifestly nonlinear, by modifying our command for generating linear-1k.csv by removing the linear weight term with --weight=0 and adding a quadratic term with --quadratic=-2:

$ simple_bear_tutorial_data quadratic.csv -r19680707 -t1000 --weight=0 --quadratic=-2

This yields quadratic.csv:

Scatterplot of quadratic.csv

Scatterplot of quadratic.csv

If we just run memory_bear on this dataset in the same way that we did in the simple tutorial,

$ memory_bear quadratic.csv 10s -dl1 -oquadratic-predictions.csv

then we find that Bear’ has the same properties as it did when the underlying relationship was linear, but now it follows the parabolic shape:

Scatterplot of quadratic-predictions.csv

Scatterplot of quadratic-predictions.csv

So Bear models nonlinear data without us having to tell it what to do. In statistical terms, Bear is “nonparametric”: it makes no assumptions about the underlying joint distribution between the features and the label.

Of course, a parametric method like linear or quadratic regression will always yield a more accurate model, if you know in advance that the underlying distribution is linear or quadratic. In those situations you would be silly to use Bear. But for many real-world machine learning problems we don’t a priori know how the label might depend on any of the features; this is what the “machine” is “learning.” In such cases, Bear may be a good tool for you to use.

Nominal fields

Bear 0.6 handles nominal fields automatically. The tutorials for this will be posted here shortly.

Trivariate nonlinear data

We have seen above that Bear can directly model continuous but nonlinear bivariate data, which we can easily visualize on a scatterplot, and binary classifcation data with either a categorical feature or multiple one-hot features, which we can view easily in tabular form. What about continuous nonlinear data with more than one feature?

As a simple but nontrivial example, let’s assume that we have trivariate data with one label z that depends on two features x and y. For simplicity, imagine that the true z is equal to 2.5 within an annulus (ring) of outer diameter 5 and inner diameter 3 in x–y space, and zero otherwise:

The true value of z is 2.5 in the blue annulus
      and zero in the white areas

The true value of z is 2.5 in the blue annulus and zero in the white areas

We can make our visualization of z(x, y) concrete by cutting off a 2.5 mm length of brass tube that has a 5 mm outer diameter and 3 mm inner diameter:

The brass tube that helps us visualize z(x,y)

The brass tube that helps us visualize z(x, y)

We place this piece of brass tube on the ground somewhere:

The piece of brass tube, sitting on the ground

The piece of brass tube, sitting on the ground

The value of z at any (x, y) position is the height of the top of the piece of tube, 2.5 mm, or the ground level z = 0 mm for those (x, y) positions that have no brass above them.

The piece of brass tube actually looks similar to the Hirshhorn Museum:

The Hirshhorn Museum

The Hirshhorn Museum

except that it doesn’t have the Hirshhorn’s “stilts” at its base. You can use either physical object to help you visualize z(x, y).

To make our mathematical representation of this piece of brass tube visualizable without special software, we will use x and y coordinates restricted to a square grid. Bear doesn’t require this; it just means that we can use a program like Excel (or any other program you choose to use) to create a surface map for us. On the other hand, we can still add some gaussian noise to the height z to represent real-world measurement noise.

If you have built Bear, then you have a program that will generate this data file for you:

$ intermediate_bear_tutorial_data tube.csv 255 tube-x.csv 50 -r 19680707

The first argument, tube.csv, is the name of the file that will be created with this trivariate data in it. The second argument, 255, specifies how many samples will be on each side of the regular x–y grid; I have chosen 255 because that’s the maximum side length that Excel can handle for a surface chart. Thus the file tube.csv contains 255 x 255 = 65,025 rows of trivariate training data representing the height of the piece of tube or the ground, as the case may be, plus some gaussian noise. It then contains another 65,025 rows with the same (x, y) values as testing data: for simplicity, we are only asking Bear for its predictions for the same x–y grid (although of course in practice we could ask for predictions at any values of x and y).

In addition to this, the program creates a dataset representing the x–z projection of the full (x, y, z) dataset, that I will use below. (By rotational symmetry, the y–z projection is fundamentally the same.) The third argument, tube-x.csv, specifies the filename of this projection, and the fourth argument, 50, the number of samples per side for it; it will be useful for this to be both smaller than and greater than what we use for the full trivariate dataset. Because we don’t have the restrictions of Excel surface charts for this bivariate x–z projection data, the x values are randomly dithered, which will also help with the visualizations below.

Let us start with the bivariate x–z projection dataset tube-x.csv. Opening it in Excel,

The data in tube-x.csv

The data in tube-x.csv

we can see immediately why this dataset would be a challenge for machine learning. Its symmetry in x means that linear regression would have been useless from the outset. But it’s clearly also bifurcated according to some “hidden” variable—which we know is just the depth, y. If we run Bear on this dataset in debug mode,

$ memory_bear tube-x.csv -dl1 10s -otube-x-predictions.csv

and look at its predictions, in tube-x-predictions.csv,

Bear's predictions for tube-x.csv, with the training data

Bear’s predictions for tube-x.csv, with the training data

we see that it has figured out that the “edges” of this “brass tire” (or “wagon wheel”) are denser, when viewed side-on, than the middle (and that this density increases as you move from the “outside wall” of the tire to the “inside wall”). This model represents with higher resolution the mean value of z as a function of x.

By circular symmetry, we know that the y–z elementary model is going to look the same.

Let’s now turn to the full trivariate dataset, tube.csv. I supply another program that reformats this data into a form that we can directly graph using Excel:

$ intermediate_bear_tutorial_reformat tube.csv 255 tube-excel.csv

This converts the 255 x 255 = 65,025 training rows into a file tube-excel.csv with 255 rows and 255 columns, where each cell contains a z value. This file can be visualized as a surface chart in Excel (or you can use whichever graphing program you like):

Excel's surface chart of tube-excel.csv

Excel’s surface chart of tube-excel.csv

Rotating the view, you can look down into the tube:

A rotated view of tube-excel.csv

A rotated view of tube-excel.csv

Let’s run Bear on this dataset:

$ memory_bear tube.csv -l2 2m -otube-predictions.csv -stube

We can reformat the predictions file tube-predictions.csv using the same program as above:

$ intermediate_bear_tutorial_reformat tube-predictions.csv 255 tube-predictions-excel.csv

Opening tube-predictions-excel.csv in Excel (or whatever you are using) and creating a surface chart of it, you should see something like

Excel's surface chart of tube-predictions-excel.csv

Excel’s surface chart of tube-predictions-excel.csv

Again rotating, we can look down into the tube:

A rotated view of the tube-predictions-excel.csv

A rotated view of tube-predictions-excel.csv

That’s arguably a decent model of the real tube, that hasn’t overfit to the noise that we added to the training dataset.

A noiseless tube

In the above we added noise to the representation of the tube, to reflect the real world. Let’s instead create a dataset with no noise:

$ intermediate_bear_tutorial_data tube-nn.csv 255 /dev/null 50 -r 19680707 -e0
$ intermediate_bear_tutorial_reformat tube-nn.csv 255 tube-nn-excel.csv

As expected, the tube is now noiseless:

Excel's surface chart of tube-nn-excel.csv

Excel’s surface chart of tube-nn-excel.csv

A rotated view of tube-nn-excel.csv

A rotated view of tube-nn-excel.csv

If we run Bear on this noiseless data,

$ memory_bear tube-nn.csv -l2 2m -otube-nn-predictions.csv -stube-nn
$ intermediate_bear_tutorial_reformat tube-nn-predictions.csv 255 tube-nn-predictions-excel.csv

and look at the results,

Excel's surface chart of tube-nn-predictions-excel.csv

Excel’s surface chart of tube-nn-predictions-excel.csv

A rotated view of tube-nn-predictions-excel.csv

A rotated view of tube-nn-predictions-excel.csv

we see that Bear’s modeling is qualitatively the same as it was for the noisy data, but sharper. This again highlights that Bear does not overfit to noise in the data.

The spiral classification problem

It was asked on Bear’s Facebook Page whether Bear could successfully solve the spiral classification problem, without being explicitly programmed for spiral shapes. I answered that it indeed could, provided that there was enough data provided for it to infer statistically significant shapes.

To show this, you can use the program spiral_classification_data to create such a dataset with a default 100,000 examples:

$ spiral_classification_data spiral.csv

which is of the same shape as the input data for Google’s neural net tutorial, which was the example given:

The data in spiral.csv

The data in spiral.csv

and also includes prediction rows, like what we had for the tube examples above.

You can run Bear on this dataset,

$ memory_bear spiral.csv 5m -sspiral -ospiral-predictions.csv -l2

and transform the predictions into an Excel-friendly format using the same program as above:

$ intermediate_bear_tutorial_reformat spiral-predictions.csv 255 spiral-excel.csv

yielding

Bear's predictions for spiral.csv

Bear’s predictions for spiral.csv

Note that Bear had essentially no data in the corners, so its predictions there just default to the overall mean probability across the entire square, namely, one-half.

Advanced tutorial

Now that you have mastered the simple and intermediate tutorials, you may as well take a glance at advanced tutorial, right? (It will be beefed up soon.)