Thursday, October 25, 2012

One-hot/One-of-k Data Encoder for Machine Learning

I've been submitting bug reports and feature requests to the scikit-learn project, a great machine learning library for python. Andreas Mueller just wrote a one-hot, or one-of-k, encoder for transforming labelled features into binary features. I decided to write a version in Factor and blog about it!

Theory

Let's assume we have a classification problem where we have a set of features, where each feature is a set of labels (or classes) and data about these classes. Maybe it's a Saturday, you don't have to go to work, and you live in an apartment in New York. You're trying to decide if you want to go outside based on looking outside your window. If the conditions are right, you will dress accordingly and go outside. Looking outside, you can see several things (our feature set):

It looks like it's { "hot" "cold" } and { "cloudy" "raining" "snowing" "sunny" }. People are wearing { "light" "heavy" } { "bright-colored" "dark-colored" "neutral" } clothes and walking { "quickly" "slowly" }. You feel { "well" "sick" "tired" }. If you pick one label from each set above, then you will have one set of datapoints and you can decide whether or not to go outside.

A day we might go outside:
{ "hot" "sunny" "light" "neutral" "slowly" "well" }

A day we would stay inside:
{ "cold" "snowing" "heavy" "dark-colored" "quickly" "sick" }

You could assign integral numbers (ordinals) to each of these features based on their index in their respective arrays:
{ 0 3 0 2 1 0 } --> e.g. "3" here is the base-0 index into the weather feature above
{ 1 2 1 1 0 1 }

However, some machine learning algorithms would treat these integers as ordered or continuous data, which clearly, they are not. (Note: Factor does not currently have any machine learning algorithms. Doh!)

The shape of our classes is going to be the shape of our output array. The above shape is:
{ 2 4 2 3 2 3 }

What we want, then, is an output array where the values are 0 or 1 and where the shape is as above. The 1 value goes in the spot that our index states.

{ 0 3 0 0 1 0 } -> { { 1 0 } { 0 0 0 1 } { 1 0 } { 0 0 1 } { 0 1 } { 1 0 0 } }
{ 1 2 1 1 0 1 } -> { { 0 1 } { 0 0 1 0 } { 0 1 } { 0 1 0 } { 1 0 } { 0 1 0 } }

Now, you can see why it's "one hot"--only one of the possible values is set for each feature! (I don't know where the name came from, anyone?) The other name is "one-of-k", which also makes sense.

Finally, we will eventually flatten these arrays to change the data into something the machine learning algorithm can handle.

Flattened versions:

{ 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 }
{ 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 }

Algorithm

Now to express this in code. The inputs will be an array of arrays that all contains our classes, and the indices into these arrays that express which class we have observed.

Indices, ready for input to one-hot:
{ 0 3 0 2 1 0 }

Features as an array-of-arrays for inputs:

{
    { "hot" "cold" }
    { "cloudy" "raining" "snowing" "sunny" }
    { "light" "heavy" }
    { "bright-colored" "dark-colored" "neutral" }
    { "quickly" "slowly" }
    { "well" "sick" "tired" }
}


The strategy will be to sum up the length of all of the arrays of classes, make an output array of zeros of that length, and then to set the 1 in each array.

: one-hot ( indices features -- array )
    [ 1 ] 2dip
    [ length ] map
    [ cum-sum0 v+ ]
    [ nip sum 0 <array> ] 2bi [ set-nths ] keep ;

The first line dips the 1 to the lowest position on the current stack. That's the value that the set-nths word will use later. Next, we want to get the shape array from above, so we map over the features array, taking the length.

Now, the stack looks like:
1 indices lengths

We are going to do a 2bi, which will give the indices and lengths to the next two code blocks. We want to take a cumulative sum of the lengths and then vector-add the indices to it, but the cumulative sum should start at zero instead of the first feature length. For this, we can use the accumulate word, but I made a special version of cum-sum that does this, called cum-sum0.

The result of cum-sum0 is:
{ 0 2 6 8 11 13 }

Doing a vector addition with the indices gives the offsets into the output array:
{ 0 3 0 2 1 0 }    ! indices
{ 0 2 6 8 11 13 }  ! cum-sum0
v+ .
{ 0 5 6 10 12 13 } ! offsets
In the second code block, [ nip sum 0 <array> ], we don't need the indices, so we nip them and make our array that's the sum of all the lengths.

Now the stack looks like:
1 indices zeros-array

Finally, we call set-nths and keep the array, since otherwise we'd lose it--all that work for nothing!

Here's the two examples from above in action:

{ 0 3 0 2 1 0 }
{
    { "hot" "cold" }
    { "cloudy" "raining" "snowing" "sunny" }
    { "light" "heavy" }
    { "bright-colored" "dark-colored" "neutral" }
    { "quickly" "slowly" }
    { "well" "sick" "tired" }
} one-hot .
{ 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 }


{ 1 2 1 1 0 1 }
{
    { "hot" "cold" }
    { "cloudy" "raining" "snowing" "sunny" }
    { "light" "heavy" }
    { "bright-colored" "dark-colored" "neutral" }
    { "quickly" "slowly" }
    { "well" "sick" "tired" }
} one-hot .
{ 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 }

Conclusion

We didn't actually do any machine learning because the example is somewhat contrived, I didn't generate a complete dataset, and the focus was on preparing the data anyway.

One difference between this and the scikit code is that scikit is object-oriented and makes transformer objects that have a specified interface. If I were using this in a larger machine learning library, I'd make an object that wraps the algorithm and precomputes the lengths of the features. Then the algorithm could basically start from v+ onwards, and also avoid the call to sum; repeated calls would be marginally faster.

The code is here.