The Awesome Factor

Friday, May 29, 2009

A new library to manage client connections, and a proof of concept chat server in Factor

Some thirty seconds into play-testing Joe Groff's terrain demo game, I realized there were no NPCs, no double-barreled shotguns or hand-cannons, and most disturbingly, no other players. Sure you can fly around, but you can't gib your friends! So I decided to do start to do something about it and write a managed server vocabulary with a generic protocol that you can extend for writing your own servers. Hey, I'm not much of an artist -- the guns will have to wait.

Managed-server infrastructure

The HTTP server already uses a library called io.servers.connections which implements a threaded-server with SSL support in 164 lines of code. A threaded-server listens for a set number of clients on an open port and handles each one individually; no client needs to know about any other. To get the code so concise, it uses libraries for concurrency, logging, sockets, and SSL that are themselves reused elsewhere.
Features of the threaded-server vocabulary:
  • the best nonblocking I/O code on every platform (completion ports, kqueue, epoll, not select())
  • connection/error logging, log rotation
  • correct error handling and resource cleanup
  • SSL support on Unix platforms
  • IPv4 and IPv6 support by default
For an HTTP or FTP server, handling each connection individually is what you want. However, for games or chat servers, you really want your users to interact. Building on top of this thread-server, I made a new tuple called a managed-server that tracks a list of connecting and disconnecting clients. You still get all of the features threaded-server implements, but now there's a new client handler that maintains a list of connected clients keyed by a username and utility words to send data to all clients.
You can also use this code to make custom binary protocols, and I'm mostly through implemented an SRP6 library to allow secure unencrypted logins after you create an account through an SSL connection. UDP support for first-person shooter and faster-paced games will also be supported when someone needs it.

The implementation of managed-server

A managed-server inherits from threaded-server class and adds a new slot called clients to store connections. Each connection's state -- the input/output streams, the local/remote socket addresses, username, a slot for passing quit messages, and a quit flag -- is wrapped inside a managed-client tuple and stored into the clients hashtable with the username as the key. In this way, it's easy to look up another client's stream and send it a message:
"wally" "hi wally!" send-client
You can also send a message to all connected clients,send-everyone, or to all but yourself:
"This one goes out to all the ladies." send-everyone-else
Here's what the tuple classes code looks like:
TUPLE: managed-server < threaded-server clients ;

TUPLE: managed-client
input-stream output-stream local-address remote-address
username object quit? ;

The managed-server protocol

A managed-server has some generics in place to guide you in creating your own servers. The first two generics are required, but the others default to no-ops unless you want to handle these events. Of course, the clients are still tracked no matter what your method does on the client-join or client-disconnect generics. The default method for handle-already-logged-in throws an error to prevent a new client from taking over the other client's session or logging in multiple times. You can override this behavior with your own perversions.
Here's the protocol:

HOOK: handle-login threaded-server ( -- username )
HOOK: handle-managed-client* managed-server ( -- )
HOOK: handle-already-logged-in managed-server ( -- )
HOOK: handle-client-join managed-server ( -- )
HOOK: handle-client-disconnect managed-server ( -- )

The implementation of a chat server using managed-server

Eventually someone will use managed-server for the networking code in a game, but until then I've implemented a simple chat server. Writing the chat server was fun and helped me to iron out a couple of bugs, which I wrote about below.

A walkthrough of the chat server protocol


The chat server code begins by inheriting from the managed-server tuple:
TUPLE: chat-server < managed-server ;
From here you go about implementing required parts of the protocol, handle-login and handle-managed-client*, so let's start there.
M: chat-server handle-login
"Username: " write flush
readln ;
The current input/output streams are bound to the client connection, so calling write will send them the login prompt. To read back the username, readln reads until a newline is sent. If you were to connect with telnet at this point, you would see the prompt and could send back a username. Then the server would kick you off because there's no implementation of handle-managed-client*.
M: chat-server handle-managed-client*
readln dup f = [ t client (>>quit?) ] when
[
"/" ?head [ handle-command ] [ handle-chat ] if
] unless-empty ;
This word handles every other message the client sends apart from the login code. Calling readln reads the client's message one line at a time and returns false when the stream closes. The quit flag is set in such a case and will be explained later. For now, suffice to say that you're quitting if readln returns false. Next, the message is checked for any content -- both false and an empty string can be safely ignored here by the unless-empty combinator. Inside the quotation, the leading slash is stripped from the input, if any, and a boolean returned by ?head decides if the message was intended for the server or the chat room.
: handle-command ( string -- )
dup " " split1 swap >lower commands get at* [
call( string -- ) drop
] [
2drop "Unknown command: " prepend print flush
] if ;
Commands sent to the server are normalized by converting to lower case and then looked up in the commands table. If you send a successful command such as /who or /nick then it gets executed; if not you get the generic "Unknown command" error.
: handle-chat ( string -- )
[
[ username ": " ] dip
] "" append-outputs-as send-everyone ;
Sending a message to the chat room is the alternative to server commands. I'm using append-outputs-as here to append together a bunch of strings, although i could easily have used 3append instead. I left this in because it's easier to change the look of the chat if you don't have to keep track of how many strings you're appending and you just let the compiler infer. Please take note: smart combinators in Factor are analogous to applying a function to a list or parameters in Lisp in that you don't need to know the number of parameters. The following two snippets will demonstrate what I mean:
(+ 1 2 10 1200)
[ 1 2 10 1200 ] sum-outputs
That's pretty much the essence of the chat server since everything else was just added for fun.

Fun with default encodings

Default encodings are terrible! Of course, you can change the encoding of a stream whenever you want, but the encoding for threaded-servers defaulted to ASCII until I changed it this evening. When I made my chat server yesterday, I forgot to set the encoding to what I wanted -- UTF8. Sending a character above 127 caused the server to throw an exception since ASCII is only 7 bits wide, and the sender would get disconnected. The FTP server I wrote started out with this bug as well, before I changed it to latin1. But now that threaded-server takes an encoding on the stack, this bug can never happen again.
So what's wrong with picking a different default encoding, maybe UTF8? Well, if I'm making a binary server, the UTF8 decoder will replace bad bit sequences with replacement characters -- another latent bug! What about binary as the default encoding, i.e. no encoding? Binary is the best option for a default, but then people who need to use UTF8 or latin1 might not know that the stream protocol supports encodings at all, and will end up doing a lot of work by hand which should be handled by the stream implementation. So not having a default encoding 1) prevents latent bugs and 2) forces the programmer to think about what they really want in each situation -- surely a good idea.

Quitting gracefully with quit flag

My first thought was just to throw an exception when I wanted to disconnect a client and cause the error handler to clean up the resources. Hopefully it's common knowledge that control flow implemented with exceptions is inefficient and bad design, in the general case. Maybe just this once? Nope, in my case the logging framework logs all exceptions, so the majority of the logs would be filled up with useless disconnect error messages. Clearly something better was needed -- the quit flag. Managed clients have a quit flag slot that is checked every time the server processes some data. Clients can quit gracefully by setting this flag to true and returning control back to the stream reader loop, and quits caused by exceptions are logged and worthy of further investigation.

Live coding on the server

After the chat server was up and running, I could add features without restarting. One of the first requested features was "/help", which required a redesign of how slash commands were handled. Instead of a case statement, now there's a word add-command that takes the implementation, the documentation, and the name of the command you want to add. Adding a command stores the code and the docs in symbols holding hashtables, indexed by the name of the command.
SYMBOL: commands
commands [ H{ } clone ] initialize

SYMBOL: chat-docs
chat-docs [ H{ } clone ] initialize

:: add-command ( quot docs key -- )
quot key commands get set-at
docs key chat-docs get set-at ;
I added a time command for fun:
[ drop gmt timestamp>rfc822 print flush ]
<" Syntax: /time
Returns the current GMT time."> "time" add-command
Someone else wanted a "/who" command -- easy enough.
[ drop clients keys [ "``" "''" surround ] map ", " join print flush ]
<" Syntax: /who
Shows the list of connected users.">
"who" add-command
There last feature I implemented was a way to change your nickname without reconnecting:
: handle-nick ( string -- )
[
"nick" usage
] [
dup clients key? [
username-taken-string print flush
] [
[ username swap warn-name-changed ]
[ username clients rename-at ]
[ client (>>username) ] tri
] if
] if-empty ;

[ handle-nick ]
<" Syntax: /nick nickname
Changes your nickname.">
"nick" add-command
Changing your nickname is straightforward but takes the most steps of all the commands I implemented. Try to understand the code -- "string" in the stack effect is the requested nickname and the clients word returns a hashtable of connections, indexed by nicknames. If the user didn't supply a nickname, remind them of the syntax for using /nick. If they did supply a nickname, check if it's in use and, if so, refuse to change their name. Otherwise, the nick change succeeded, so tell all the users of the nickname change, apply the nick change in the clients hashtable, and set the new nickname in the client.

Chat server running live

You can try out the chat server by downloading Factor and running this command:
USING: managed-server.chat io.servers.connection ; 8889 <chat-server> start-server
Or you can connect to my running chat server:
telnet trifocus.net 8889
It's just a demo and I didn't implement any limits on your nickname or what you can send, though it would be easy enough to do so. Have fun, and please let me know if you can find any bugs.

Sunday, May 24, 2009

An ID3 Parser in Factor

A new contributor, Tim Wawrzynczak, wrote an ID3 parser as his first Factor program a couple of months back. The code looked pretty good, so it was easy to refactor the way ID3v1 and ID3v2 tags are represented and to add some utility words for managing directories of MP3s. The library still needs some work, but now it can take a directory tree and recursively parse all of the ID3 headers. I also realized that Factor's mmap implementation always tried to map a file read/write, so for ID3 parsing I added a read-only mmap. The finished code is here.

ID3v1 format


ID3 tags come in two flavors -- the old ID3v1 format, and the newer ID3v2 one. ID3v1 has a fixed length of 128 bytes and, if present, is the last 128 bytes of an MP3 file. The bytes begin with "TAG" and follow with the song title (30 bytes), artist (30 bytes), album (30 bytes), year (4 bytes), comment (30 bytes), and the genre (1 byte). The problem with this approach is that you are limited in the length of all the fields and in which data you can represent.

To implement the parser, we use the Memory-mapped files vocabulary to open the file. You can treat the file as an array of bytes that obeys the sequence protocol. Checking if a file has ID3v1 headers becomes:
: id3v1? ( mmap -- ? )
{ [ length 128 >= ] [ 128 tail-slice* "TAG" head? ] } 1&& ;
The logic here is straightforward: the sequence (mmapped file) is checked to make sure it's long enough to contain a header, and then the last 128 bytes are checked against the magic bytes "TAG". Since this is a short-circuit combinator, if the length test fails then the computation will end early. In this way, you can string together long computations that use short-circuit and and or. Factor's usual and/or words take two already-evaluated objects, so the short-circuit behavior is implemented as a library of macros instead.

ID3v2 format


The newer standard, ID3v2, has more room for metadata and can store anything up to 256 MB. Its use is indicated when the MP3 begins with "ID3" or when the bytes "3DI" are present 10 bytes from the end of the file or 10 bytes before the ID3v1 header. For parsing the header, the first two bytes are the version, then a flag byte, and finally four bytes for the size of the header. The size bytes are encoded as a synchsafe number, which means that the top bit is discarded and the lower seven bits are the data. In our case, 28 bits of data are the length, which is why the maximum length is 256MB.

The actual data we want to parse is stored in frames. Each frame has a tag, which is four ascii or utf16 bytes, followed by another 4 byte synchsafe integer for the length, two bytes of flag data, and the frame data. There are many different types of frames, but some of the more common ones are tagged TALB (album title), TIT2 (title), TRCK (track), COMM (comment), TPE1 (performer). All of the frame are added to a hashtable and keyed by the tag. Looking up the title frame becomes as easy as "TIT2" find-id3-frame on an ID3 tuple.

Future work


The ID3v1 "TAG+" format is not supported yet and apparently it's hardly ever used. ID3v2 tags are only looked for at the beginning of an MP3 but may be present at the end too as of version four. Most of the ID3 tags are not parsed into meaningful data besides the raw bytes. Lastly, writing out ID3 tags is not implemented yet and would be a good first program to write in Factor. Volunteers?

Tuesday, January 13, 2009

Files and file-systems in Factor, part 2

In my previous post I wrote about the file-info and file-system-info words. Using those words, it's possible to write a portable program for directory listing (dir or ls) and one for file-systems listing (df). Uses for directory listing include file-system utility programs and FTP servers.

File listing tool

The file-listing tool is in the Factor git repository as basis/tools/files/files.factor. Which slots you see depend on how it is configured and on which platform it's running. Directory listings can be sorted by slot and the default sort is by name.

On Unix platforms, it's configured like this:
    <listing-tool>
{ permissions nlinks user group file-size file-date file-name } >>specs
{ { directory-entry>> name>> <=> } } >>sort
Listing a directory on MacOSX:
-rw-r--r-- 1  erg staff 27185 Nov 28  2008 #factor.el#
drwxr-xr-x 6 erg staff 204 Nov 17 2008 Factor.tmbundle
-rw-r--r-- 1 erg staff 30282 Nov 30 2008 factor.el
-rw-r--r-- 1 erg staff 18037 Nov 17 2008 factor.vim
-rw-r--r-- 1 erg staff 12496 Nov 17 2008 factor.vim.fgen
drwxr-xr-x 26 erg staff 884 Jan 13 11:17 fuel
drwxr-xr-x 8 erg staff 272 Nov 17 2008 icons
Windows platforms take the look of the ``dir'' command by default:
2009-01-14 00:00:53 <DIR>                Factor.tmbundle
2009-01-14 00:00:53 30282 factor.el
2009-01-14 00:00:53 18037 factor.vim
2009-01-14 00:00:53 12496 factor.vim.fgen
2009-01-14 00:00:53 <DIR> fuel
2009-01-14 00:00:53 <DIR> icons

File-systems tool

A tool for disk usage and mounted devices was easy to write once file-system tuples worked everywhere. Once again, it's configurable for whatever you want to see. The default file-system word is:
: file-systems. ( -- )
{
device-name available-space free-space used-space
total-space percent-used mount-point
} print-file-systems ;
File-systems on a Mac:
device-name   available-space free-space   used-space   total-space  percent-used mount-point
/dev/disk0s2 118599725056 118861869056 200867090432 319728959488 62 /
fdesc 0 0 1024 1024 100 /dev
fdesc 0 0 1024 1024 100 /dev
map -hosts 0 0 0 0 0 /net
map auto_home 0 0 0 0 0 /home

On Windows:
device-name      available-space free-space used-space total-space percent-used mount-point
3814506496 3814506496 6911225856 10725732352 64 C:\
VBOXADDITIONS_2. 0 0 27983872 27983872 100 D:\
0 0 0 0 0 A:\

Friday, January 09, 2009

Files and file-systems in Factor, part 1

Factor now has an easy way access to get information about files and file-systems in a high-level way across all the platforms that it supports. The API is really simple -- pass a pathname and get information back about the file or file-system as a tuple. The second part of this post will demonstrate a clone of the Unix tools ls, for listing files, and df, for listing file-systems.

File-info

There are now words to get information about files and symlinks, using file-info and link-info which map to C system calls stat and lstat. Some slots are shared across all platforms while others are only present on a particular platform. There are symbols representing all of the file types, like +regular-file+, +directory+, and +symbolic-link+. Here are some examples.
MacOSX
( scratchpad ) "resource:license.txt" file-info .
T{ bsd-file-info
{ type +regular-file+ }
{ size 1252 }
{ permissions 33188 }
{ created
T{ timestamp
{ year 2008 }
{ month 11 }
{ day 17 }
{ hour 23 }
{ minute 34 }
{ second 5 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ modified
T{ timestamp
{ year 2008 }
{ month 11 }
{ day 17 }
{ hour 23 }
{ minute 34 }
{ second 5 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ accessed
T{ timestamp
{ year 2008 }
{ month 12 }
{ day 9 }
{ hour 12 }
{ minute 34 }
{ second 8 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ uid 501 }
{ gid 20 }
{ dev 234881026 }
{ ino 992362 }
{ nlink 1 }
{ rdev 0 }
{ blocks 8 }
{ blocksize 4096 }
{ birth-time
T{ timestamp
{ year 2008 }
{ month 11 }
{ day 17 }
{ hour 23 }
{ minute 34 }
{ second 5 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ flags 0 }
{ gen 0 }
}
Windows XP
( scratchpad ) "resource:license.txt" file-info .
T{ windows-file-info
{ type +regular-file+ }
{ size 1252 }
{ permissions 32 }
{ created
T{ timestamp
{ year 2008 }
{ month 3 }
{ day 23 }
{ hour 23 }
{ minute 28 }
{ second 12 }
}
}
{ modified
T{ timestamp
{ year 2008 }
{ month 3 }
{ day 27 }
{ hour 23 }
{ minute 24 }
{ second 12 }
}
}
{ accessed
T{ timestamp
{ year 2008 }
{ month 9 }
{ day 19 }
{ hour 23 }
{ minute 8 }
{ second 41 }
}
}
{ attributes { +archive+ } }
}
FreeBSD
( scratchpad ) "resource:license.txt" file-info .
T{ bsd-file-info
{ type +regular-file+ }
{ size 1252 }
{ permissions 33188 }
{ created
T{ timestamp
{ year 2008 }
{ month 4 }
{ day 6 }
{ hour 12 }
{ minute 6 }
{ second 53 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ modified
T{ timestamp
{ year 2008 }
{ month 4 }
{ day 6 }
{ hour 12 }
{ minute 6 }
{ second 53 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ accessed
T{ timestamp
{ year 2008 }
{ month 4 }
{ day 6 }
{ hour 12 }
{ minute 6 }
{ second 59 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ uid 1002 }
{ gid 1002 }
{ dev 89 }
{ ino 343452 }
{ nlink 1 }
{ rdev 1348575 }
{ blocks 4 }
{ blocksize 4096 }
{ birth-time
T{ timestamp
{ year 2008 }
{ month 4 }
{ day 6 }
{ hour 12 }
{ minute 6 }
{ second 53 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ flags 0 }
{ gen 0 }
}

File Systems

The file-system utility word above works on file-system tuples that contain cross-platform information like the device name, the mount point, the number of free, used, and total bytes. A file-system tuple on Unix has all of the file-system information found in both statfs and statvfs while a Windows file-system object has the device-id, volume name, and byte usage slots. There is not a single win32 API call that gives as much information as on Unix systems -- instead I call a combination of GetDiskFreeSpaceEx, FindFirstVolume, GetVolumePathNamesForVolumeName, GetVolumeInformation.
On every Unix besides Linux, there is a member of the statfs or statvfs structure that gives you the file-system that contains the file. So, I had to roll my own for it to work the same way across all platforms. The algorithm is pretty simple: the follow-links word follows links up to 10 times (configurable) and once it stops, finds the parent directory and follows the links again until the directory is a member of the directories in the /etc/mtab file. If there is circularity or a broken link, it throws an error.

FreeBSD
( scratchpad ) "/" file-system-info .
T{ freebsd-file-system-info
{ device-name "/dev/da0s1a" }
{ mount-point "/" }
{ type "ufs" }
{ available-space 368117760 }
{ free-space 409702400 }
{ used-space 110110720 }
{ total-space 519813120 }
{ block-size 2048 }
{ preferred-block-size 2048 }
{ blocks 253815 }
{ blocks-free 200050 }
{ blocks-available 179745 }
{ files 65790 }
{ files-free 61501 }
{ files-available 61501 }
{ name-max 255 }
{ flags 20480 }
{ id { 0 0 } }
{ version 537068824 }
{ io-size 16384 }
{ owner 0 }
{ syncreads 0 }
{ syncwrites 0 }
{ asyncreads 0 }
{ asyncwrites 0 }
}
Windows XP 64
( scratchpad ) "k:\\" file-system-info .
T{ win32-file-system-info
{ device-name "" }
{ mount-point "k:\\" }
{ type "NTFS" }
{ available-space 174530142208 }
{ free-space 174530142208 }
{ used-space 225547444224 }
{ total-space 400077586432 }
{ max-component 255 }
{ flags 459007 }
{ device-serial 3695676537 }
}
The Factor build farm will start using file-system-info to report when a drive fills up pretty soon.

Sorting by tuple slots

Factor's sorting has been extended to support sorting tuples by multiple slots, one after the other. In a music player application, you may wish to sort your songs first by artist, then by album, then track number. If you had a tuple representing every song, the code for such a sort is now very easy:
{ { artist>> <=> } { album>> <=> } { track>> <=> } } sort-by-slots
The main word at work is sort-by-slots ( seq sort-specs -- seq' ), where a sort-spec is an accessor paired with a comparator, one of <=>, >=<, human-<=>, human->=<. Human sort first converts consecutive digits to integers and then makes the comparison, e.g. { "a1" "a10" "a03" "a2" } sorts as { "a1" "a2" "a03" "a10" } instead of { "a03" "a1" "a10" "a2" }, as it would with natural-sort.

Sorting in reverse order is possible with the new operator >=<, which inverts the result of the <=> comparator. To sort a playlist by the most played songs in reverse order (most played first):
{ { play-count>> >=< } } sort-by-slots

Source code for sort-by-slots

: slot-comparator ( accessor comparator -- quot )
'[ [ _ execute ] bi@ _ execute dup +eq+ eq? [ drop f ] when ] ;

MACRO: compare-slots ( sort-specs -- <=> )
#! sort-spec: { accessor comparator }
[ first2 slot-comparator ] map '[ _ 2|| +eq+ or ] ;

: sort-by-slots ( seq sort-specs -- seq' )
'[ _ compare-slots ] sort ;
First, a bit about how Factor's sort ( seq quot -- sortedseq ) word works. It expects a quotation that compares two objects lexicographically, which is element by element in dictionary order. Thus, the macro compare-slots expands the sort-specs into a quotation that compares tuples slot-by-slot until there is a difference. The macro-expansion for the first example looks like this:
( scratchpad ) [ { { artist>> <=> } { album>> <=> } { track>> <=> } } compare-slots ] expand-macros .
f 2 [
\ artist>> \ <=> [ [ execute ] curry bi@ ] dip execute
dup +eq+ eq? [ drop f ] when
] [ [ drop ] dip ndup ] dip call dup
[ 2 [ ndrop ] curry dip ] [
2 [
\ album>> \ <=> [ [ execute ] curry bi@ ] dip execute
dup +eq+ eq? [ drop f ] when
] [ [ drop ] dip ndup ] dip call dup
[ 2 [ ndrop ] curry dip ] [
2 [
\ track>> \ <=> [ [ execute ] curry bi@ ] dip
execute dup +eq+ eq? [ drop f ] when
] [ [ drop ] dip ndup ] dip call dup
[ 2 [ ndrop ] curry dip ]
[ 2 [ drop ] dip ndrop t [ f ] [ no-cond ] if ] if
] if
] if +eq+ or

The macro first extracts the slots and compares with the first comparator. If the comparator returns +lt+ or +gt+ then the comparison is over, but if the comparator returns +eq+ then the next comparison is invoked and the sort continues in the same way. Notice that the slot-comparator word replaces +eq+ with f to avoid short-circuiting the iteration done by 2||. The final sort-by-slots word is a trivial call to sort with the new comparator word we just defined.

Algorithmic complexity and ideas for improvement

The sorting bound is still O(n log n) for time, but of course the more comparators that you need to break ties, the longer the algorithm will take to run.
If your data has to be sorted anyway, it's possible to sort and then split the data into related slices using a sequence of accessors like the above. You could then compute the playing time of every album, or find your most-played song from every artist. Splitting with the same accessors takes O(n) time, but the odds are that you probably won't have already sorted the data. So for unsorted data, there is a more efficient way to extract features -- you can instead partition it in O(n) time and find features on each partition, avoiding sorting altogether. This will be the subject of a future blog post.

Sunday, October 19, 2008

Moving primitives out of the Factor VM: Factor vs C

In a previous lifetime, I added environment variable primitives to the VM. Well, it turns out that a better place for them is in the Factor basis vocabulary root, so this post is about moving them again.

To move a primitive out of the VM, implement its functionality in Factor code and replace usages with your word if necessary, remove it from vm/primitives.c and core/bootstrap/primitives.factor, remove the primitive code from the VM, make a new image, recompile Factor, and bootstrap. Basically, do the inverse of the previous post.

What's more interesting, unless you have to actually remove a primitive someday and need this reference, is comparing the code for each version side-by-side. The C code has to call functions to prevent the garbage collector from moving data around when it shouldn't, but it's written without the usual ifdefs you will find in most cross-platform C code, so overall it's fairly clean. The Factor version has a high-level protocol that is implemented by both backends across separate files, with one-liners for most of the Unix definitions and high-level combinators for the Windows ones. I find the Factor version much easier to understand and I believe it's more maintainable. Factor is a better C than C.

High-level environment variable interface

Factor's high-level environment variable words let you get a single variable or all of them, set a single variable or all of them, and unset a variable. On Windows you cannot set all of the variables at once, and on Windows CE the whole concept of environment variables does not exist.

Here is the code for the main vocabulary. Notice that there are hooks on the os word, which will be a value like macosx or winnt or linux. The boilerplate at the bottom is for loading the platform-specific code.

USING: assocs combinators kernel sequences splitting system
vocabs.loader ;
IN: environment

HOOK: os-env os ( key -- value )

HOOK: set-os-env os ( value key -- )

HOOK: unset-os-env os ( key -- )

HOOK: (os-envs) os ( -- seq )

HOOK: (set-os-envs) os ( seq -- )

: os-envs ( -- assoc )
(os-envs) [ "=" split1 ] H{ } map>assoc ;

: set-os-envs ( assoc -- )
[ "=" swap 3append ] { } assoc>map (set-os-envs) ;

{
{ [ os unix? ] [ "environment.unix" require ] }
{ [ os winnt? ] [ "environment.winnt" require ] }
{ [ os wince? ] [ ] }
} cond

Unix environment variables, before and after

DEFINE_PRIMITIVE(os_env)
{
char *name = unbox_char_string();
char *value = getenv(name);
if(value == NULL)
dpush(F);
else
box_char_string(value);
}

DEFINE_PRIMITIVE(os_envs)
{
GROWABLE_ARRAY(result);
REGISTER_ROOT(result);
char **env = environ;

while(*env)
{
CELL string = tag_object(from_char_string(*env));
GROWABLE_ARRAY_ADD(result,string);
env++;
}

UNREGISTER_ROOT(result);
GROWABLE_ARRAY_TRIM(result);
dpush(result);
}

DEFINE_PRIMITIVE(set_os_env)
{
char *key = unbox_char_string();
REGISTER_C_STRING(key);
char *value = unbox_char_string();
UNREGISTER_C_STRING(key);
setenv(key, value, 1);
}

DEFINE_PRIMITIVE(unset_os_env)
{
char *key = unbox_char_string();
unsetenv(key);
}

DEFINE_PRIMITIVE(set_os_envs)
{
F_ARRAY *array = untag_array(dpop());
CELL size = array_capacity(array);

/* Memory leak */
char **env = calloc(size + 1,sizeof(CELL));

CELL i;
for(i = 0; i < size; i++)
{
F_STRING *string = untag_string(array_nth(array,i));
CELL length = to_fixnum(string->length);

char *chars = malloc(length + 1);
char_string_to_memory(string,chars);
chars[length] = '\0';
env[i] = chars;
}

environ = env;
}
Factor
USING: alien alien.c-types alien.strings alien.syntax kernel
layouts sequences system unix environment io.encodings.utf8
unix.utilities vocabs.loader combinators alien.accessors ;
IN: environment.unix

HOOK: environ os ( -- void* )

M: unix environ ( -- void* ) "environ" f dlsym ;

M: unix os-env ( key -- value ) getenv ;

M: unix set-os-env ( value key -- ) swap 1 setenv io-error ;

M: unix unset-os-env ( key -- ) unsetenv io-error ;

M: unix (os-envs) ( -- seq )
environ *void* utf8 alien>strings ;

: set-void* ( value alien -- ) 0 set-alien-cell ;

M: unix (set-os-envs) ( seq -- )
utf8 strings>alien malloc-byte-array environ set-void* ;

os {
{ macosx [ "environment.unix.macosx" require ] }
[ drop ]
} case

MacOSX environment variables, before and after

On OSX, we have to use a function to access the environment variable.

#ifndef environ
extern char ***_NSGetEnviron(void);
#define environ (*_NSGetEnviron())
#endif
Factor
USING: alien.syntax system environment.unix ;
IN: environment.unix.macosx

FUNCTION: void* _NSGetEnviron ( ) ;

M: macosx environ _NSGetEnviron ;

Windows NT environment variables, before and after

Draw your own conclusions.

DEFINE_PRIMITIVE(os_env) 
{
F_CHAR *key = unbox_u16_string();
F_CHAR *value = safe_malloc(MAX_UNICODE_PATH * 2);
int ret;
ret = GetEnvironmentVariable(key, value, MAX_UNICODE_PATH * 2);
if(ret == 0)
dpush(F);
else
dpush(tag_object(from_u16_string(value)));
free(value);
}

DEFINE_PRIMITIVE(os_envs)
{
GROWABLE_ARRAY(result);
REGISTER_ROOT(result);

TCHAR *env = GetEnvironmentStrings();
TCHAR *finger = env;

for(;;)
{
TCHAR *scan = finger;
while(*scan != '\0')
scan++;
if(scan == finger)
break;

CELL string = tag_object(from_u16_string(finger));
GROWABLE_ARRAY_ADD(result,string);

finger = scan + 1;
}

FreeEnvironmentStrings(env);

UNREGISTER_ROOT(result);
GROWABLE_ARRAY_TRIM(result);
dpush(result);
}

DEFINE_PRIMITIVE(set_os_env)
{
F_CHAR *key = unbox_u16_string();
REGISTER_C_STRING(key);
F_CHAR *value = unbox_u16_string();
UNREGISTER_C_STRING(key);
if(!SetEnvironmentVariable(key, value))
general_error(ERROR_IO, tag_object(get_error_message()), F, NULL);
}

DEFINE_PRIMITIVE(unset_os_env)
{
if(!SetEnvironmentVariable(unbox_u16_string(), NULL)
&& GetLastError() != ERROR_ENVVAR_NOT_FOUND)
general_error(ERROR_IO, tag_object(get_error_message()), F, NULL);
}

DEFINE_PRIMITIVE(set_os_envs)
{
not_implemented_error();
}
Factor
USING: alien.strings fry io.encodings.utf16 kernel
splitting windows windows.kernel32 ;
IN: environment.winnt

M: winnt os-env ( key -- value )
MAX_UNICODE_PATH "TCHAR" <c-array>
[ dup length GetEnvironmentVariable ] keep over 0 = [
2drop f
] [
nip utf16n alien>string
] if ;

M: winnt set-os-env ( value key -- )
swap SetEnvironmentVariable win32-error=0/f ;

M: winnt unset-os-env ( key -- )
f SetEnvironmentVariable 0 = [
GetLastError ERROR_ENVVAR_NOT_FOUND =
[ win32-error ] unless
] when ;

M: winnt (os-envs) ( -- seq )
GetEnvironmentStrings [
<memory-stream> [
utf16n decode-input
[ "\0" read-until drop dup empty? not ]
[ ] [ drop ] produce
] with-input-stream*
] [ FreeEnvironmentStrings win32-error=0/f ] bi ;

Saturday, October 18, 2008

Introducing the Factor database library

Background

One of the first libraries I wrote in Factor was a binding to PostgreSQL library in May 2005. Factor makes it incredibly easy to bind to C functions -- just add a FUNCTION: declaration and call the C function. For example,

( scratchpad ) FUNCTION: int getuid ( ) ;
( scratchpad ) getuid .
0
At about the same time, Chris Double wrote a SQLite binding and a high-level library he called tuple-db. It had some good ideas, such as query-by-example and prepared SQL statements for better security.

Earlier this year, I took both of these libraries and merged them to come up with the the current database library. The supported backends are PostgreSQL and SQLite and there are protocols for implementing other backends with relative ease. Contributions are welcome.

Overview

Briefly, Factor tuples correspond one-to-one to tables in the database through a mapping defined with the define-persistent word. Tuples are then manipulated through insert, update, delete, and select words. Upon calling the insert word, all of the filled-in slots in a tuple will be saved to the database. The process is reversible, and by filling in slots for a tuple and calling select-tuples, the libary generates a select statement from the filled-in slots (query-by-example) and returns tuples in a sequence. More advanced queries and lower level raw SQL statements are possible as well.

First, we'll look at how to connect to a database.

Connecting to a database

Every database has its own setup which Factor encapsultes in a tuple. SQLite just needs a path to a file on disk, but since PostgreSQL has a server/client model, the setup is more complex. After making your database tuple, connecting works the same for any database by using the with-db word. This word takes a quotation (a block of code) and your tuple object, then calls the database open routine and your quotation. After running your quotation, the database is closed, even if your code throws an exception. Having an interface like this means you can simply swap out your SQLite database for a networked PostgreSQL one if you suddenly need more scalability or networked database access.

You should generally make a custom combinator with your project's connection information. Here are a couple of examples.

SQLite example combinator:

USING: db.sqlite db io.files ;
: with-sqlite-db ( quot -- )
"my-database.db" temp-file <sqlite-db> swap with-db ; inline

PostgreSQL example combinator:

USING: db.postgresql db ;
: with-postgresql-db ( quot -- )
<postgresql-db>
"localhost" >>host
5432 >>port
"user" >>username
"seeecrets?" >>password
"factor-test" >>database
swap with-db ; inline
To make sure your database connection works, you can test it with an empty quotation:
 [ ] with-postgresql-db 

If that line of code doesn't throw an error, you can safely assume connecting to the database worked.

Defining persistent tuples

The highest-level database abstraction Factor offers relates tuples directly to database tables. Tuples must map one-to-one to a database table, but foreign keys to other tables are allowed.

Primary keys

Each tuple needs a primary key for indexing by the database. The primary key can be an increasing integer assigned by the database (a +db-assigned-id+), an object assigned by the user (a +user-defined-id+), a random number, or even a compound key consisting of multiple values used together as a key.

Database defined primary keys are automatically set on the tuple after insertion. While SQLite makes this feature easy to implement with the sqlite3_last_insert_rowid library call, PostgreSQL lacks such a feature and instead, in the backend, the inserts are done through a SQL function that queries the most recently inserted row.

User-assigned ids can be anything that the programmer knows will be unique for each object in the table. The same is true for compound keys, but only one of the values has to differ in this case for it to be unique.

Sometimes, a randomly generated id is useful, and the database library makes it easy to associate a tuple with its random key. By default, it generates a 64-bit random integer and in the unlikely event that it collides with an existing entry, it tries again up to ten times. The random integer is then set in the primary key slot of the tuple.

Database types

Data should have a type to allow the database to optimize its queries and storage. The supported data types are booleans, varchars, text, integers, big-integers, doubles, reals, timestamps, byte-arrays, URLs, and Factor blobs, which store arbitrary Factor objects. Types can have default values, unique qualifiers, and not-null restrictions. The database framework does all of the marshalling of objects for you transparently.

Creating and dropping tables

There are quite a few options when it comes to creating tables. The most basic is the create-table word. The problem is that it throws an exception when the table exists, which happens often. So one alternative is ensure-table -- we make sure a table exists and silently ignore errors. However, if we're just developing an application and we want to make sure the table is always the latest version of the code, we might use recreate-table, which drops and creates the table without throwing errors, and of course drops all the data with it. To simply drop a table, use the drop-table word.

Inserting tuples

Tuples are inserted one at a time with the insert-tuple word. SQL insert commands are generated directly from the Factor object and its filled-in slots. An example will follow.

Selecting tuples

Select statements are generated from exemplar tuples, or tuples with certain slots filled in with any value besides f. Passing an empty exemplar tuple will select all tuples from the table and return them as a sequence. A useful feature we support is querying by ranges, sequences, or intervals, as shown in following demonstration.

Exams demo

About the simplest example of interest is one that matches students names with their grades on a particular exam. Here's aa exam tuple, its mapping to the database, and a utility word to generate random exam objects.

USING: math.ranges random db.types ;

TUPLE: exam id name score ;

exam "EXAM" {
{ "id" "ID" +db-assigned-id+ }
{ "name" "NAME" TEXT }
{ "score" "SCORE" INTEGER }
} define-persistent

: random-exam ( -- exam )
f
6 [ CHAR: a CHAR: z [a,b] random ] replicate >string
100 [0,b] random
exam boa ;

I'll use a trick for opening a database to use it interactively without putting a with-db wrapper around every call.

"my-database.db" temp-file <sqlite-db> db-open db set

Note that the database handle should be cleaned up with db get dispose if you don't want to leak memory.

Let's create the table and add some random exams to the database:

exam create-table

25 [ random-exam insert-tuple ] times

Now to demonstrate selects. You can select all the exams:

T{ exam } select-tuples .
{
T{ exam { id 1 } { name "qklkzk" } { score 38 } }
T{ exam { id 2 } { name "feeuwv" } { score 38 } }
T{ exam { id 3 } { name "hlzwuu" } { score 51 } }
T{ exam { id 4 } { name "liiptp" } { score 52 } }
T{ exam { id 5 } { name "mwzlmv" } { score 74 } }
...
}

Ranges can be used with any datatype that makes sense, like integers or timestamps.

A range of passing exams:

T{ exam } 70 100 [a,b] >>score select-tuples .
{
T{ exam { id 5 } { name "mwzlmv" } { score 74 } }
T{ exam { id 6 } { name "ftxhuw" } { score 90 } }
T{ exam { id 11 } { name "rorbyd" } { score 83 } }
T{ exam { id 16 } { name "ttvkar" } { score 78 } }
T{ exam { id 17 } { name "nnzkvs" } { score 99 } }
T{ exam { id 21 } { name "izoiqi" } { score 83 } }
}

You can search for specific grades by using a sequence:

T{ exam } { 10 20 30 40 50 60 70 80 90 100 } >>score select-tuples .
{
T{ exam { id 6 } { name "ftxhuw" } { score 90 } }
T{ exam { id 9 } { name "bdoztd" } { score 30 } }
T{ exam { id 20 } { name "lnmxfq" } { score 60 } }
}

An interval query. Nobody should have above a 100:

USE: math.intervals T{ exam } 100 1/0. (a,b) >>score select-tuples .
{ }

Let's order by the grade:

<query> T{ exam } >>tuple "score" >>order select-tuples .
...
T{ exam { id 21 } { name "izoiqi" } { score 83 } }
T{ exam { id 6 } { name "ftxhuw" } { score 90 } }
T{ exam { id 17 } { name "nnzkvs" } { score 99 } }
}

Or by (random-generated) name:

<query> T{ exam } >>tuple "name" >>order select-tuples .
{
T{ exam { id 13 } { name "aagnof" } { score 61 } }
T{ exam { id 36 } { name "ailftz" } { score 41 } }
T{ exam { id 9 } { name "bdoztd" } { score 30 } }
...

Updates and deletes are also by example and may use sequences and intervals in the 'where' clause in the same way as the selects with the update-tuples and delete-tuples words.

Raw SQL

You can drop down into SQL if you want to do anything complex or things that are not supported by the library. A drawback is that the SQL you write may not work across SQL databases. Notice that, for a select, you have to convert the data to Factor tuples yourself since the data is returned as an array of strings.

"select * from exam where name like '%a%'" sql-query .
{
{ "8" "rxoyga" "53" }
{ "10" "fchhha" "1" }
{ "13" "aagnof" "61" }
{ "16" "ttvkar" "78" }
{ "22" "yhqkav" "28" }
}

Raw SQL that does not return results is possible too:

"delete from exam where name like '%a%'" sql-command
"select * from exam where name like '%a%'" sql-query length .
0

Another tutorial

Here is a tutorial from the database documentation that shows moref basic usage of the library. If you run this from within the Factor environment itself, you can click on the code blocks and run them one by one without copy/pasting or typing them in yourself.

Real-world usage

The Factor Wiki uses the database library, as does Factor's web framework, Furnace. Expect to see more useful websites and applications using it in the future.

Lastly, if anyone can suggest a catchy name for the database library, please let me know.

Wednesday, September 03, 2008

Vocabulary and documentation tool: tools.scaffold

The Factor project is on a documentation binge until all of the core and basis vocabularies are documented. I noticed, as anyone tasked to write a bunch of documenation would, that quite a bit of it could be automated. To begin writing docs for a vocabulary that already exists, you can just run the tool in it and edit the generated help file.

I actually ended up writing two tools -- one for the docs, and another to create new vocabularies with a simple command.

Scaffold tool to create new vocabularies


First, an aside. Factor's module system is based on a directory layout, with several vocabulary roots which are stored in the vocab-roots symbol.
( scratchpad ) USE: vocabs.loader vocab-roots get .
V{ "resource:core" "resource:basis" "resource:extra" "resource:work" }
Knowing this, you can now create a new vocabulary and boilerplate empty Factor files from inside of Factor and click the links to edit them in your favorite editor.
( scratchpad ) "extra" "alchemy" scaffold-vocab
Creating scaffolding for P" resource:extra/alchemy/alchemy.factor"
Creating scaffolding for P" resource:extra/alchemy/alchemy-tests.factor"
Creating scaffolding for P" resource:extra/alchemy/authors.txt"

( scratchpad ) "extra" "alchemy" scaffold-help
Creating scaffolding for P" resource:extra/alchemy/alchemy-docs.factor"
The scaffold tool is programmed not to overwrite files if they already exist.

Scaffold tool for documenation



Say I have a Factor word that turns lead into gold:
: lead>gold ( lead -- gold ) ... ;

The implementation of this word is left as an exercise to the reader. We don't have to understand how it works to document what it does.

Without an automation tool, you would begin by creating a new file (extra/alchemy/alchemy-docs.factor) in the same directory as your code file (extra/alchemy/alchemy.factor). What follows is the standard bolierplate for documentation. Notice that documenation goes in the same vocabulary as your actual code, but comes from different files on disk.

! Copyright (C) 2008 Doug Coleman.
! See http://factorcode.org/license.txt for BSD license.
USING: help.markup help.syntax ;
IN: alchemy
Every doc file has such a header based on the date, who you are, and the vocabulary name. With those three pieces of information, the above can be auto-generated. I added a new symbol developer-name in tools.scaffold that should be set to your name when writing docs.

Now let's look at the documenation for lead>gold:
HELP: lead>gold
{ $values { "lead" "boring lead" } { "gold" "shiny gold" } }
{ $description "Turns lead into gold." } ;


Notice that we know the word name, the first strings in each stack effect in the $values array, and the markup structure.

Here is the generated scafford:
HELP: lead>gold
{ $values { "lead" null } { "gold" null } }
{ $description } ;
This is much less wrist punishment than typing manually. The scaffold tool can even generate the documentation with their Factor types given correct code and standard type names like "quot" and "hashtable". Types that are not known are given the null class. Documenation correctness tools can later search for null objects and report them as errors, since no words should operate on this type.

I'll finish by including a snippet of the output of running the scaffold help generator on itself.
! Copyright (C) 2008 Doug Coleman.
! See http://factorcode.org/license.txt for BSD license.
USING: arrays help.markup help.syntax io.streams.string kernel strings words ;
IN: tools.scaffold

HELP: check-root
{ $values
{ "string" string }
{ "string" string } }
{ $description } ;

HELP: check-scaffold
{ $values
{ "vocab-root" "a vocabulary root string" } { "string" string }
{ "vocab-root" "a vocabulary root string" } { "string" string } }
{ $description } ;

HELP: check-vocab-name
{ $values
{ "string" string }
{ "string" string } }
{ $description } ;

HELP: developer-name
{ $description } ;

HELP: help.
{ $values
{ "word" word } }
{ $description } ;

HELP: lookup-type
{ $values
{ "string" string }
{ "object/string" null } { "?" "a boolean" } }
{ $description } ;

HELP: main-file-string
{ $values
{ "vocab" "a vocabulary specifier" }
{ "string" string } }
{ $description } ;

HELP: not-a-vocab-root
{ $values
{ "string" string } }
{ $description } ;

HELP: not-a-vocab-root?
{ $values
{ "object" object }
{ "?" "a boolean" } }
{ $description } ;

HELP: root?
{ $values
{ "string" string }
{ "?" "a boolean" } }
{ $description } ;

HELP: scaffold-authors
{ $values
{ "path" "a pathname string" } }
{ $description } ;

HELP: scaffold-copyright
{ $description } ;

HELP: tests-file-string
{ $values
{ "vocab" "a vocabulary specifier" }
{ "string" string } }
{ $description } ;

HELP: using
{ $description } ;

HELP: vocab>scaffold-path
{ $values
{ "vocab-root" "a vocabulary root string" } { "string" string }
{ "path" "a pathname string" } }
{ $description } ;

ARTICLE: "tools.scaffold" "tools.scaffold"
;

ABOUT: "tools.scaffold"
I hope this new vocabulary makes documenation a lot less tedious and error-prone. Now to go write some docs...