# The Awesome Factor

## Thursday, October 25, 2012

### One-hot/One-of-k Data Encoder for Machine Learning

I've been submitting bug reports and feature requests to the scikit-learn project, a great machine learning library for python. Andreas Mueller just wrote a one-hot, or one-of-k, encoder for transforming labelled features into binary features. I decided to write a version in Factor and blog about it!

### Theory

Let's assume we have a classification problem where we have a set of features, where each feature is a set of labels (or classes) and data about these classes. Maybe it's a Saturday, you don't have to go to work, and you live in an apartment in New York. You're trying to decide if you want to go outside based on looking outside your window. If the conditions are right, you will dress accordingly and go outside. Looking outside, you can see several things (our feature set):

It looks like it's { "hot" "cold" } and { "cloudy" "raining" "snowing" "sunny" }. People are wearing { "light" "heavy" } { "bright-colored" "dark-colored" "neutral" } clothes and walking { "quickly" "slowly" }. You feel { "well" "sick" "tired" }. If you pick one label from each set above, then you will have one set of datapoints and you can decide whether or not to go outside.

A day we might go outside:
{ "hot" "sunny" "light" "neutral" "slowly" "well" }

A day we would stay inside:
{ "cold" "snowing" "heavy" "dark-colored" "quickly" "sick" }

You could assign integral numbers (ordinals) to each of these features based on their index in their respective arrays:
{ 0 3 0 2 1 0 } --> e.g. "3" here is the base-0 index into the weather feature above
{ 1 2 1 1 0 1 }

However, some machine learning algorithms would treat these integers as ordered or continuous data, which clearly, they are not. (Note: Factor does not currently have any machine learning algorithms. Doh!)

The shape of our classes is going to be the shape of our output array. The above shape is:
{ 2 4 2 3 2 3 }

What we want, then, is an output array where the values are 0 or 1 and where the shape is as above. The 1 value goes in the spot that our index states.

{ 0 3 0 0 1 0 } -> { { 1 0 } { 0 0 0 1 } { 1 0 } { 0 0 1 } { 0 1 } { 1 0 0 } }
{ 1 2 1 1 0 1 } -> { { 0 1 } { 0 0 1 0 } { 0 1 } { 0 1 0 } { 1 0 } { 0 1 0 } }

Now, you can see why it's "one hot"--only one of the possible values is set for each feature! (I don't know where the name came from, anyone?) The other name is "one-of-k", which also makes sense.

Finally, we will eventually flatten these arrays to change the data into something the machine learning algorithm can handle.

Flattened versions:

{ 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 }
{ 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 }

### Algorithm

Now to express this in code. The inputs will be an array of arrays that all contains our classes, and the indices into these arrays that express which class we have observed.

Indices, ready for input to one-hot:
{ 0 3 0 2 1 0 }

Features as an array-of-arrays for inputs:

{
{ "hot" "cold" }
{ "cloudy" "raining" "snowing" "sunny" }
{ "light" "heavy" }
{ "bright-colored" "dark-colored" "neutral" }
{ "quickly" "slowly" }
{ "well" "sick" "tired" }
}

The strategy will be to sum up the length of all of the arrays of classes, make an output array of zeros of that length, and then to set the 1 in each array.

```: one-hot ( indices features -- array )
[ 1 ] 2dip
[ length ] map
[ cum-sum0 v+ ]
[ nip sum 0 <array> ] 2bi [ set-nths ] keep ;
```

The first line dips the 1 to the lowest position on the current stack. That's the value that the set-nths word will use later. Next, we want to get the shape array from above, so we map over the features array, taking the length.

Now, the stack looks like:
1 indices lengths

We are going to do a 2bi, which will give the indices and lengths to the next two code blocks. We want to take a cumulative sum of the lengths and then vector-add the indices to it, but the cumulative sum should start at zero instead of the first feature length. For this, we can use the accumulate word, but I made a special version of cum-sum that does this, called cum-sum0.

The result of cum-sum0 is:
{ 0 2 6 8 11 13 }

Doing a vector addition with the indices gives the offsets into the output array:
```{ 0 3 0 2 1 0 }    ! indices
{ 0 2 6 8 11 13 }  ! cum-sum0
v+ .
{ 0 5 6 10 12 13 } ! offsets
```
In the second code block, [ nip sum 0 <array> ], we don't need the indices, so we nip them and make our array that's the sum of all the lengths.

Now the stack looks like:
1 indices zeros-array

Finally, we call set-nths and keep the array, since otherwise we'd lose it--all that work for nothing!

Here's the two examples from above in action:

{ 0 3 0 2 1 0 }
{
{ "hot" "cold" }
{ "cloudy" "raining" "snowing" "sunny" }
{ "light" "heavy" }
{ "bright-colored" "dark-colored" "neutral" }
{ "quickly" "slowly" }
{ "well" "sick" "tired" }
} one-hot .
{ 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 }

{ 1 2 1 1 0 1 }
{
{ "hot" "cold" }
{ "cloudy" "raining" "snowing" "sunny" }
{ "light" "heavy" }
{ "bright-colored" "dark-colored" "neutral" }
{ "quickly" "slowly" }
{ "well" "sick" "tired" }
} one-hot .
{ 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 }

### Conclusion

We didn't actually do any machine learning because the example is somewhat contrived, I didn't generate a complete dataset, and the focus was on preparing the data anyway.

One difference between this and the scikit code is that scikit is object-oriented and makes transformer objects that have a specified interface. If I were using this in a larger machine learning library, I'd make an object that wraps the algorithm and precomputes the lengths of the features. Then the algorithm could basically start from v+ onwards, and also avoid the call to sum; repeated calls would be marginally faster.

The code is here.

## Motivation for writing a coverage tool

In Factor, the old code-is-data adage is true and can be used to our advantage when writing runtime tools. Code is stored in what we call a quotation, which is just a sequence of functions (words) and data literals and can be called at runtime. A function (or word) in Factor is simply a named quotation, and typing that word name causes the quotation to run.

Since quotations in Factor do not terminate early unless an exception is thrown, we know that if each quotation gets run, then we have run all of the code. While this won't tell us if there are logic errors, and while we can still have code fail due to unexpected inputs, at least we can be sure that all of the code is getting exercise and there are no paths that never get run. We can use the results of this tool to write better unit tests.

## Demo of the coverage tool

As a simple demo, there's a fun algorithm that produces a sequence of numbers from a start number, where the sequence always seems to end at one. For any integer greater than zero, the the Collatz conjecture states that this sequence will eventually reach the number one. The algorithm goes as follows: take any counting number (1, 2, ...) and either multiply by three and add 1, or divide by two, if the number is odd or even, respectively, and record the sequence of numbers until you reach the number one. This conjecture has been found to be true experimentally for every number up to 10^18, but no proof exists that it's true for all possible inputs.

For example, the Collatz sequence for 3 is: { 3 10 5 16 8 4 2 1 }

We can come up with a solution pretty easily (partially taken from the Project Euler #14, in extra/).

We do the command: "collatz" scaffold-work

Click on the link, edit the file ~/factor/extra/collatz/collatz.factor
`USING: combinators.short-circuit kernel make math ;IN: collatz: next-collatz ( n -- n )    dup even? [ 2 / ] [ 3 * 1 + ] if ;: collatz-unsafe ( n -- seq )    [ [ dup 1 > ] [ dup , next-collatz ] while , ] { } make ;ERROR: invalid-collatz-input n ;: collatz ( n -- seq )    dup { [ integer? ] [ 1 >= ] } 1&&    [ collatz-unsafe ]    [ invalid-collatz-input ] if ;`
We're going to be extra careful here to demonstrate the code coverage tool, so I added error checking to make sure it's an integer greater than or equal to one.

Let's write some unit tests to make sure it works.

Run the command: "collatz" scaffold-tests
Now click on the link to edit the file it created, ~/factor/extra/collatz/collatz-tests.factor
`USING: tools.test collatz ;IN: collatz.tests[ { 1 } ] [ 1 collatz ] unit-test[ { 2 1 } ] [ 2 collatz ] unit-test[ { 3 10 5 16 8 4 2 1 } ] [ 3 collatz ] unit-test`
If we run "collatz" test, we see that all tests pass.

Now, let's see how well we did with code coverage.

We run the command:
`IN: scratchpad USE: tools.coverage "collatz" test-coverage .Loading resource:work/collatz/collatz-tests.factorUnit Test: { [ { 1 } ] [ 1 collatz ] }Unit Test: { [ { 2 1 } ] [ 2 collatz ] }Unit Test: { [ { 3 10 5 16 8 4 2 1 } ] [ 3 collatz ] }{    { next-collatz { } }    { collatz { [ invalid-collatz-input ] } }    {        invalid-collatz-input        { [ \ invalid-collatz-input boa throw ] }    }    { collatz-unsafe { } }}`
What this tells us is we had quotations in the collatz word and the invalid-collatz-input words that did not get called. Of course -- we never passed it anything other than valid inputs. How about passing it a string or a negative integer?

Now the result looks better:
`IN: scratchpad "collatz" test-coverage .Loading resource:work/collatz/collatz-tests.factorUnit Test: { [ { 1 } ] [ 1 collatz ] }Unit Test: { [ { 2 1 } ] [ 2 collatz ] }Unit Test: { [ { 3 10 5 16 8 4 2 1 } ] [ 3 collatz ] }Must Fail With: { [ "hello world" collatz ] [ invalid-collatz-input? ] }Must Fail With: { [ -50 collatz ] [ invalid-collatz-input? ] }{    { next-collatz { } }    { collatz { } }    { invalid-collatz-input { } }    { collatz-unsafe { } }}`
We can even get a number for how well tests cover a vocabulary:
`"collatz" %coverage .1.0`

## The implementation of the coverage tool

Every word in Factor stores its definition in the `def' slot. If we examine this slot, we see that it's a quotation that may contain other quotations. Using the annotation vocabulary, we can add code that executes before and after the code in the quotation. What the coverage tool does is adds a container that stores a boolean flag at the beginning of each quotation, and when the quotation gets run, the flag is set to true. This tool can be turned on and off, independent of the containers being in place, with the coverage-on and coverage-off words.

Here's a word that turns a Roman numeral into an integer:
`\ roman> def>> .[ >lower [ roman>= ] monotonic-split [ (roman>) ] map-sum ]`
After annotating it, the new definition looks like:
`\ roman> add-coverage \ roman> def>> .[    T{ coverage } flag-covered >lower    [ T{ coverage } flag-covered roman>= ] monotonic-split    [ T{ coverage } flag-covered (roman>) ] map-sum]`
The flags are all set to false right now. After turning on the flag to let enable the coverage code and running the word, we see a change:
`coverage-on "iii" roman> drop \ roman> def>> .[    T{ coverage { executed? t } } flag-covered >lower    [ T{ coverage { executed? t } } flag-covered roman>= ]    monotonic-split    [ T{ coverage { executed? t } } flag-covered (roman>) ]    map-sum]`
Notice that all of the coverage containers have been executed. To generate the report, we simply iterate over each word and collect all of the quotations where this flag has not been set -- these quotations never ran.

In writing this article, I realized that each quotation should have a flag on exit as well, in case an exception gets thrown in the middle of executing this quotation and control never reaches the end. Partially-executed quotations will soon be reported by the tool, after I make this fix.

I hope you can use this tool to improve your Factor code. Have fun!

## Saturday, September 04, 2010

### Filling a file with zeros

In this blog post I'll demonstrate a few of ways to fill a file with zeros in Factor. The goal is to write a some number of bytes to file in the least amount of time and using only a small amount of RAM; writing a large file should not fail.

## Filling a file with zeros by seeking

The best way of writing a file full of zeros is to seek to one byte from the end of the file, write a zero, and close the file. Here's the code:
`: (zero-file) ( n path -- )    binary     [ 1 - seek-absolute seek-output 0 write1 ] with-file-writer ;ERROR: invalid-file-size n path ;: zero-file ( n path -- )       {        { [ over 0 < ] [ invalid-file-size ] }        { [ over 0 = ] [ nip touch-file ] }        [ (zero-file) ]    } cond ;`
The first thing you'll notice about the `zero-file` is that we special-case negative and zero file sizes. Special-casing zero file length is necessary to avoid seeking to -1, which does everything correctly but throws an error in the process instead of returning normally. Special-casing negative file sizes is important because it's always an error, and though the operation fails overall, the file-system can become littered with zero-length files that are created before the exception is thrown.

To call the new word:
`123,456,789 "/Users/erg/zeros.bin" zero-file"/Users/erg/zeros.bin" file-info size>> .123456789`

## Copying a zero-stream

With Factor's stream protocol, you can write new kinds of streams that, when read from or written to, do whatever you want. I wrote a read-only `zero-stream` below that returns zeros whenever you read from it. Wrapping a `limit-stream` around it, you can give the inexhaustible `zero-stream` an artificial length, so that copying it reaches an end and terminates.
`TUPLE: zero-stream ;C: <zero-stream> zero-streamM: zero-stream stream-read drop <byte-array> ;M: zero-stream stream-read1 drop 0 ;M: zero-stream stream-read-partial stream-read ;M: zero-stream dispose drop ;:: zero-file2 ( n path -- )    <zero-stream> n limit-stream     path binary <file-writer> stream-copy ;`
The drawback to this approach is that it creates 8kb byte-arrays in memory that it immediately writes to disk.

## Setting the contents of a file directly

Using the `set-file-contents` word, you can just assign a file's contents to be a sequence. However, this sequence has to fit into memory, so this solution is not as good for our use case.
`:: zero-file3 ( n path -- )    n <byte-array> path binary set-file-contents ;`

## Bonus: writing random data to a file

The canonical way of copying random data to a file in Unix systems is to use the dd tool to read from /dev/urandom and write to a file. But what about on Windows, where there is no /dev/urandom? We can come up with a cross-platform solution that uses method number two from above, but instead of a `zero-stream`, we have a `random-stream`. But then what about efficiency? Well, it turns out that Factor's Mersenne Twister implementation generates random numbers faster than /dev/urandom on my Macbook -- writing a 100MB file from /dev/urandom is about twice as slow as a Factor-only solution. So not only is the Factor solution cross-platform, it's also more efficient.
`TUPLE: random-stream ;C: <random-stream> random-streamM: random-stream stream-read drop random-bytes ;M: random-stream stream-read1 drop 256 random ;M: random-stream stream-read-partial stream-read ;M: random-stream dispose drop ;:: stream-copy-n ( from to n -- )    from n limit-stream to stream-copy ;:: random-file ( n path -- )         path binary <file-writer> n stream-copy-n ;! Read from /dev/urandom:: random-file-urandom ( n path -- )    [         path        binary <file-writer> n stream-copy-n    ] with-system-random ;`
Here are the results:
`\$ dd if=/dev/urandom of=here.bin bs=100000000 count=11+0 records in1+0 records out100000000 bytes transferred in 17.384370 secs (5752294 bytes/sec)100,000,000 "there.bin" random-fileRunning time: 5.623136439 seconds`

## Conclusion

Since Factor has high-level libraries that wrap the low-level libc and system calls used for nonblocking i/o, we don't have to deal with platform-specific quirks at this level of abstraction like handling EINTR, error codes, or resource cleanup at the operating system level. When calls get interrupted, when errno is set to EINTR after the call returns, the i/o operation is simply tried again behind the scenes, and only serious i/o errors get thrown. There are many options for correct resource cleanup should an error occur, but the error handling code we used here is incorporated into the `stream-copy` and `with-file-writer` words--resources are cleaned up regardless of what happens. We also demonstrated that a Factor word is preferable to a shell script or the dd command for making files full of random data because it's more portable and faster, and that custom streams are easy to define.

Finally, there's actually a faster way to create huge files full of zeros, and that's by using sparse files. Sparse files can start off using virtually no file-system blocks, but can appear to be as large as you wish, and only start to consume more blocks as parts of the file are written. However, support for this is file-system dependent and, overall, sparse files are of questionable use. On Unix file-systems that support sparse files, the first method above should automatically creates them with no extra work. Note that on MacOSX, sparse file-systems are supported but not enabled by default. On Windows, however, you have to make a call to `DeviceIoControl`. If someone wants to have a small contribution to the Factor project, they are welcome to implement creation of sparse files for Windows.

Edit: Thanks to one of the commenters, I rediscovered that there's a Unix syscall `truncate` that creates zero-length files in constant time on my Mac. This is indeed the best solution for making files full of zeros, and although unportable, a Factor library would have no problem using a hook on the OS variable to call truncate on Unix and another method on Windows.

## Thursday, November 19, 2009

### Monotonic timers

Factor has had a calendar library for several years now. While it's great for converting timestamps to human-readable formats, calculating holidays, and finding the number of days between two dates, it's the wrong concept to use for timing code, alarms, and thread switching. In such cases where you don't need an actual date, you should use monotonic timers, which are counters that always increment from an unspecified time in the past and aren't affected by changes to the system time. Even if the user changes the clock, these monotonic timers don't go back in time -- they keep increasing. Let's look at the implementation.

## Implementation of monotonic timers

Although I originally implemented monotonic timers as a Factor library, I moved the code into the C++ VM as a primitive called `nano-count`. To distinguish the usage of this word from the word formerly known as `micros`, I renamed `micros` to `system-micros`. Having the word "system" in the name of one time-returning word, and having "count" in the other, hopefully leads to less confusion on the user's part.

### Windows

The code I came up with for Windows looks like this:
`u64 nano_count(){    static double scale_factor;    static u32 hi = 0;    static u32 lo = 0;    LARGE_INTEGER count;    BOOL ret = QueryPerformanceCounter(&count);    if(ret == 0)        fatal_error("QueryPerformanceCounter", 0);    if(scale_factor == 0.0)    {        LARGE_INTEGER frequency;        BOOL ret = QueryPerformanceFrequency(&frequency);        if(ret == 0)            fatal_error("QueryPerformanceFrequency", 0);        scale_factor = (1000000000.0 / frequency.QuadPart);    }#ifdef FACTOR_64    hi = count.HighPart;#else    /* On VirtualBox, QueryPerformanceCounter does not increment    the high part every time the low part overflows.  Workaround. */    if(lo > count.LowPart)        hi++;#endif    lo = count.LowPart;    return (u64)((((u64)hi << 32) | (u64)lo) * scale_factor);}`
It could probably be optimized by only calling `QueryPerformanceFrequency` once, but I don't set the processor affinity yet, so I'm not convinced it will work in every case. As you can see, it's pretty simple: the performance counter is queried and returns a number of clock cycles since some arbitrary beginning epoch, and then that time is scaled by the clock frequency to get nanoseconds.
Edit: This code contains a workaround for a VirtualBox counter bug.

### Some Unix systems

`u64 nano_count(){    struct timespec t;    int ret;    ret = clock_gettime(CLOCK_MONOTONIC,&t);    if(ret != 0)        fatal_error("clock_gettime failed", 0);    return t.tv_sec * 1000000000 + t.tv_nsec;}`
Calling `clock_gettime` from the librt library or, on some platforms, as a system call, gives you the number of nanoseconds since an arbitrary start point in the past. The timespec struct has a seconds and a nanoseconds slots, while the timeval struct (used by `system-micros`) has seconds and microseconds.

### Mac OSX

`u64 nano_count(){    u64 time = mach_absolute_time();    static u64 scaling_factor = 0;    if(!scaling_factor)    {        mach_timebase_info_data_t info;        kern_return_t ret = mach_timebase_info(&info);        if(ret != 0)            fatal_error("mach_timebase_info failed",ret);        scaling_factor = info.numer/info.denom;    }    return time * scaling_factor;}`
The MacOSX code is a bit different because Apple didn't implement `clock_gettime`. Instead, they have a couple of Mach functions that function just like the Windows code, with one returning a count and the other returning clock frequency information.

The alarms vocabulary now uses monotonic timers instead of system time for scheduling alarms. Previously, the API for scheduling an alarm was the following, where passing `f` as the last input parameter would schedule a one-time alarm.
`add-alarm ( quot start-timestamp interval-duration/f -- alarm )`
However, this design is bad because the system time could change, resulting in a huge backlog of alarms to run. Also, most alarms were scheduled for less than a second into the future, which makes timestamps pretty useless since no date calculations are being performed. The new API takes a duration:
`add-alarm ( quot start-duration interval-duration/f -- alarm)`
Note that duration can be things like
• 300 milliseconds
• 5 seconds
• 200 nanoseconds

## Using monotonic timers

### Mouse drag alarm

Here's an example of using an alarm from the mouse handling code:
`: start-drag-timer ( -- )    hand-buttons get-global empty? [        [ drag-gesture ] 300 milliseconds 100 milliseconds        add-alarm drag-timer get-global >box    ] when ;`
The `drag-gesture` word gets called 300 milliseconds after a mouse button has been clicked, and again every 100 milliseconds afterwards until the alarm gets cancelled when the user releases a mouse button. The alarm is put into a global box because storing into a full box throws an error, which in this case would represent impossibility of the user dragging two things at once. Once dragging stops, the alarm gets cancelled with a call to `cancel-alarm`. You can look at the full source here.

### Benchmark word

The `benchmark` word times a quotation and returns the number of nanoseconds that its execution took. Its implementation follows:
`: benchmark ( quot -- runtime )    nano-count [ call nano-count ] dip - ; inline`
This word simply gets the count from the monotonic timer, calls the quotation, gets a new count, and finds the elapsed time by subtraction.

### Rescheduling alarms

After repeated alarms execute, they must be rescheduled to run again.
`: reschedule-alarm ( alarm -- )    dup interval>> nano-count + >>start register-alarm ;`
The alarm gets rescheduled `interval>>` nanoseconds into the future.

## Remaining issues

Putting the computer to sleep on Snow Leopard in the middle of bootstrap and then resuming does not affect timing. However, is this the case with other operating systems such as Snow Vista or Linux? If not, it might not be worth worrying about. If someone wanted to test, just start a Factor bootstrap and then put the computer to sleep for awhile and see if bootstrap time increases. Otherwise, I'll get to it eventually.
Update: Someone on the Factor mailing list reported that putting the computer to sleep on bootstrap in Linux did not mess up the timing. Thank you!

## Monday, August 17, 2009

### Twin Cities Lisp Factor Presentation

At the next TC Lisp meeting, I'll be explaining as much as I can about Factor in a one-hour presentation. Arrive before 6pm for the happy hour beer prices. All are welcome!

Place: Common Roots Cafe
Date: August 18, 2009
Time: 6pm

View Larger Map

## Friday, May 29, 2009

### A new library to manage client connections, and a proof of concept chat server in Factor

Some thirty seconds into play-testing Joe Groff's terrain demo game, I realized there were no NPCs, no double-barreled shotguns or hand-cannons, and most disturbingly, no other players. Sure you can fly around, but you can't gib your friends! So I decided to do start to do something about it and write a managed server vocabulary with a generic protocol that you can extend for writing your own servers. Hey, I'm not much of an artist -- the guns will have to wait.

## Managed-server infrastructure

The HTTP server already uses a library called io.servers.connections which implements a threaded-server with SSL support in 164 lines of code. A threaded-server listens for a set number of clients on an open port and handles each one individually; no client needs to know about any other. To get the code so concise, it uses libraries for concurrency, logging, sockets, and SSL that are themselves reused elsewhere.
• the best nonblocking I/O code on every platform (completion ports, kqueue, epoll, not select())
• connection/error logging, log rotation
• correct error handling and resource cleanup
• SSL support on Unix platforms
• IPv4 and IPv6 support by default
For an HTTP or FTP server, handling each connection individually is what you want. However, for games or chat servers, you really want your users to interact. Building on top of this thread-server, I made a new tuple called a managed-server that tracks a list of connecting and disconnecting clients. You still get all of the features threaded-server implements, but now there's a new client handler that maintains a list of connected clients keyed by a username and utility words to send data to all clients.
You can also use this code to make custom binary protocols, and I'm mostly through implemented an SRP6 library to allow secure unencrypted logins after you create an account through an SSL connection. UDP support for first-person shooter and faster-paced games will also be supported when someone needs it.

## The implementation of managed-server

A managed-server inherits from threaded-server class and adds a new slot called `clients` to store connections. Each connection's state -- the input/output streams, the local/remote socket addresses, username, a slot for passing quit messages, and a quit flag -- is wrapped inside a `managed-client` tuple and stored into the clients hashtable with the username as the key. In this way, it's easy to look up another client's stream and send it a message:
`"wally" "hi wally!" send-client`
You can also send a message to all connected clients,`send-everyone`, or to all but yourself:
`"This one goes out to all the ladies." send-everyone-else`
Here's what the tuple classes code looks like:
`TUPLE: managed-server < threaded-server clients ;TUPLE: managed-clientinput-stream output-stream local-address remote-addressusername object quit? ;`

### The managed-server protocol

A managed-server has some generics in place to guide you in creating your own servers. The first two generics are required, but the others default to no-ops unless you want to handle these events. Of course, the clients are still tracked no matter what your method does on the client-join or client-disconnect generics. The default method for `handle-already-logged-in` throws an error to prevent a new client from taking over the other client's session or logging in multiple times. You can override this behavior with your own perversions.
Here's the protocol:
`HOOK: handle-login threaded-server ( -- username )HOOK: handle-managed-client* managed-server ( -- )HOOK: handle-already-logged-in managed-server ( -- )HOOK: handle-client-join managed-server ( -- )HOOK: handle-client-disconnect managed-server ( -- )`

## The implementation of a chat server using managed-server

Eventually someone will use managed-server for the networking code in a game, but until then I've implemented a simple chat server. Writing the chat server was fun and helped me to iron out a couple of bugs, which I wrote about below.

### A walkthrough of the chat server protocol

The chat server code begins by inheriting from the managed-server tuple:
`TUPLE: chat-server < managed-server ;`
From here you go about implementing required parts of the protocol, `handle-login` and `handle-managed-client*`, so let's start there.
`M: chat-server handle-login    "Username: " write flush    readln ;`
The current input/output streams are bound to the client connection, so calling `write` will send them the login prompt. To read back the username, `readln` reads until a newline is sent. If you were to connect with telnet at this point, you would see the prompt and could send back a username. Then the server would kick you off because there's no implementation of `handle-managed-client*`.
`M: chat-server handle-managed-client*    readln dup f = [ t client (>>quit?) ] when    [        "/" ?head [ handle-command ] [ handle-chat ] if    ] unless-empty ;`
This word handles every other message the client sends apart from the login code. Calling `readln` reads the client's message one line at a time and returns false when the stream closes. The quit flag is set in such a case and will be explained later. For now, suffice to say that you're quitting if `readln` returns false. Next, the message is checked for any content -- both false and an empty string can be safely ignored here by the `unless-empty` combinator. Inside the quotation, the leading slash is stripped from the input, if any, and a boolean returned by `?head` decides if the message was intended for the server or the chat room.
`: handle-command ( string -- )    dup " " split1 swap >lower commands get at* [        call( string -- ) drop    ] [        2drop "Unknown command: " prepend print flush    ] if ;`
Commands sent to the server are normalized by converting to lower case and then looked up in the commands table. If you send a successful command such as /who or /nick then it gets executed; if not you get the generic "Unknown command" error.
`: handle-chat ( string -- )    [        [ username ": " ] dip    ] "" append-outputs-as send-everyone ;`
Sending a message to the chat room is the alternative to server commands. I'm using `append-outputs-as` here to append together a bunch of strings, although i could easily have used `3append` instead. I left this in because it's easier to change the look of the chat if you don't have to keep track of how many strings you're appending and you just let the compiler infer. Please take note: smart combinators in Factor are analogous to applying a function to a list or parameters in Lisp in that you don't need to know the number of parameters. The following two snippets will demonstrate what I mean:
`(+ 1 2 10 1200)`
`[ 1 2 10 1200 ] sum-outputs`
That's pretty much the essence of the chat server since everything else was just added for fun.

### Fun with default encodings

Default encodings are terrible! Of course, you can change the encoding of a stream whenever you want, but the encoding for threaded-servers defaulted to ASCII until I changed it this evening. When I made my chat server yesterday, I forgot to set the encoding to what I wanted -- UTF8. Sending a character above 127 caused the server to throw an exception since ASCII is only 7 bits wide, and the sender would get disconnected. The FTP server I wrote started out with this bug as well, before I changed it to latin1. But now that threaded-server takes an encoding on the stack, this bug can never happen again.
So what's wrong with picking a different default encoding, maybe UTF8? Well, if I'm making a binary server, the UTF8 decoder will replace bad bit sequences with replacement characters -- another latent bug! What about binary as the default encoding, i.e. no encoding? Binary is the best option for a default, but then people who need to use UTF8 or latin1 might not know that the stream protocol supports encodings at all, and will end up doing a lot of work by hand which should be handled by the stream implementation. So not having a default encoding 1) prevents latent bugs and 2) forces the programmer to think about what they really want in each situation -- surely a good idea.

### Quitting gracefully with quit flag

My first thought was just to throw an exception when I wanted to disconnect a client and cause the error handler to clean up the resources. Hopefully it's common knowledge that control flow implemented with exceptions is inefficient and bad design, in the general case. Maybe just this once? Nope, in my case the logging framework logs all exceptions, so the majority of the logs would be filled up with useless disconnect error messages. Clearly something better was needed -- the quit flag. Managed clients have a quit flag slot that is checked every time the server processes some data. Clients can quit gracefully by setting this flag to true and returning control back to the stream reader loop, and quits caused by exceptions are logged and worthy of further investigation.

## Live coding on the server

After the chat server was up and running, I could add features without restarting. One of the first requested features was "/help", which required a redesign of how slash commands were handled. Instead of a case statement, now there's a word `add-command` that takes the implementation, the documentation, and the name of the command you want to add. Adding a command stores the code and the docs in symbols holding hashtables, indexed by the name of the command.
`SYMBOL: commandscommands [ H{ } clone ] initializeSYMBOL: chat-docschat-docs [ H{ } clone ] initialize:: add-command ( quot docs key -- )    quot key commands get set-at    docs key chat-docs get set-at ;`
I added a time command for fun:
`[ drop gmt timestamp>rfc822 print flush ]<" Syntax: /timeReturns the current GMT time."> "time" add-command`
Someone else wanted a "/who" command -- easy enough.
`[ drop clients keys [ "``" "''" surround ] map ", " join print flush ]<" Syntax: /whoShows the list of connected users.">"who" add-command`
There last feature I implemented was a way to change your nickname without reconnecting:
`: handle-nick ( string -- )    [        "nick" usage    ] [        dup clients key? [            username-taken-string print flush        ] [            [ username swap warn-name-changed ]            [ username clients rename-at ]            [ client (>>username) ] tri        ] if    ] if-empty ;[ handle-nick ]<" Syntax: /nick nicknameChanges your nickname.">"nick" add-command`
Changing your nickname is straightforward but takes the most steps of all the commands I implemented. Try to understand the code -- "string" in the stack effect is the requested nickname and the clients word returns a hashtable of connections, indexed by nicknames. If the user didn't supply a nickname, remind them of the syntax for using /nick. If they did supply a nickname, check if it's in use and, if so, refuse to change their name. Otherwise, the nick change succeeded, so tell all the users of the nickname change, apply the nick change in the clients hashtable, and set the new nickname in the client.

## Chat server running live

`USING: managed-server.chat io.servers.connection ; 8889 <chat-server> start-server`
Or you can connect to my running chat server:
`telnet trifocus.net 8889`
It's just a demo and I didn't implement any limits on your nickname or what you can send, though it would be easy enough to do so. Have fun, and please let me know if you can find any bugs.

## Sunday, May 24, 2009

### An ID3 Parser in Factor

A new contributor, Tim Wawrzynczak, wrote an ID3 parser as his first Factor program a couple of months back. The code looked pretty good, so it was easy to refactor the way ID3v1 and ID3v2 tags are represented and to add some utility words for managing directories of MP3s. The library still needs some work, but now it can take a directory tree and recursively parse all of the ID3 headers. I also realized that Factor's mmap implementation always tried to map a file read/write, so for ID3 parsing I added a read-only mmap. The finished code is here.

## ID3v1 format

ID3 tags come in two flavors -- the old ID3v1 format, and the newer ID3v2 one. ID3v1 has a fixed length of 128 bytes and, if present, is the last 128 bytes of an MP3 file. The bytes begin with "TAG" and follow with the song title (30 bytes), artist (30 bytes), album (30 bytes), year (4 bytes), comment (30 bytes), and the genre (1 byte). The problem with this approach is that you are limited in the length of all the fields and in which data you can represent.

To implement the parser, we use the Memory-mapped files vocabulary to open the file. You can treat the file as an array of bytes that obeys the sequence protocol. Checking if a file has ID3v1 headers becomes:
`: id3v1? ( mmap -- ? )    { [ length 128 >= ] [ 128 tail-slice* "TAG" head? ] } 1&& ;`
The logic here is straightforward: the sequence (mmapped file) is checked to make sure it's long enough to contain a header, and then the last 128 bytes are checked against the magic bytes "TAG". Since this is a short-circuit combinator, if the length test fails then the computation will end early. In this way, you can string together long computations that use short-circuit and and or. Factor's usual `and`/`or` words take two already-evaluated objects, so the short-circuit behavior is implemented as a library of macros instead.

## ID3v2 format

The newer standard, ID3v2, has more room for metadata and can store anything up to 256 MB. Its use is indicated when the MP3 begins with "ID3" or when the bytes "3DI" are present 10 bytes from the end of the file or 10 bytes before the ID3v1 header. For parsing the header, the first two bytes are the version, then a flag byte, and finally four bytes for the size of the header. The size bytes are encoded as a synchsafe number, which means that the top bit is discarded and the lower seven bits are the data. In our case, 28 bits of data are the length, which is why the maximum length is 256MB.

The actual data we want to parse is stored in frames. Each frame has a tag, which is four ascii or utf16 bytes, followed by another 4 byte synchsafe integer for the length, two bytes of flag data, and the frame data. There are many different types of frames, but some of the more common ones are tagged TALB (album title), TIT2 (title), TRCK (track), COMM (comment), TPE1 (performer). All of the frame are added to a hashtable and keyed by the tag. Looking up the title frame becomes as easy as `"TIT2" find-id3-frame` on an ID3 tuple.

## Future work

The ID3v1 "TAG+" format is not supported yet and apparently it's hardly ever used. ID3v2 tags are only looked for at the beginning of an MP3 but may be present at the end too as of version four. Most of the ID3 tags are not parsed into meaningful data besides the raw bytes. Lastly, writing out ID3 tags is not implemented yet and would be a good first program to write in Factor. Volunteers?