Tuesday, January 13, 2009

Files and file-systems in Factor, part 2

In my previous post I wrote about the file-info and file-system-info words. Using those words, it's possible to write a portable program for directory listing (dir or ls) and one for file-systems listing (df). Uses for directory listing include file-system utility programs and FTP servers.

File listing tool

The file-listing tool is in the Factor git repository as basis/tools/files/files.factor. Which slots you see depend on how it is configured and on which platform it's running. Directory listings can be sorted by slot and the default sort is by name.

On Unix platforms, it's configured like this:
    <listing-tool>
{ permissions nlinks user group file-size file-date file-name } >>specs
{ { directory-entry>> name>> <=> } } >>sort
Listing a directory on MacOSX:
-rw-r--r-- 1  erg staff 27185 Nov 28  2008 #factor.el#
drwxr-xr-x 6 erg staff 204 Nov 17 2008 Factor.tmbundle
-rw-r--r-- 1 erg staff 30282 Nov 30 2008 factor.el
-rw-r--r-- 1 erg staff 18037 Nov 17 2008 factor.vim
-rw-r--r-- 1 erg staff 12496 Nov 17 2008 factor.vim.fgen
drwxr-xr-x 26 erg staff 884 Jan 13 11:17 fuel
drwxr-xr-x 8 erg staff 272 Nov 17 2008 icons
Windows platforms take the look of the ``dir'' command by default:
2009-01-14 00:00:53 <DIR>                Factor.tmbundle
2009-01-14 00:00:53 30282 factor.el
2009-01-14 00:00:53 18037 factor.vim
2009-01-14 00:00:53 12496 factor.vim.fgen
2009-01-14 00:00:53 <DIR> fuel
2009-01-14 00:00:53 <DIR> icons

File-systems tool

A tool for disk usage and mounted devices was easy to write once file-system tuples worked everywhere. Once again, it's configurable for whatever you want to see. The default file-system word is:
: file-systems. ( -- )
{
device-name available-space free-space used-space
total-space percent-used mount-point
} print-file-systems ;
File-systems on a Mac:
device-name   available-space free-space   used-space   total-space  percent-used mount-point
/dev/disk0s2 118599725056 118861869056 200867090432 319728959488 62 /
fdesc 0 0 1024 1024 100 /dev
fdesc 0 0 1024 1024 100 /dev
map -hosts 0 0 0 0 0 /net
map auto_home 0 0 0 0 0 /home

On Windows:
device-name      available-space free-space used-space total-space percent-used mount-point
3814506496 3814506496 6911225856 10725732352 64 C:\
VBOXADDITIONS_2. 0 0 27983872 27983872 100 D:\
0 0 0 0 0 A:\

Friday, January 09, 2009

Files and file-systems in Factor, part 1

Factor now has an easy way access to get information about files and file-systems in a high-level way across all the platforms that it supports. The API is really simple -- pass a pathname and get information back about the file or file-system as a tuple. The second part of this post will demonstrate a clone of the Unix tools ls, for listing files, and df, for listing file-systems.

File-info

There are now words to get information about files and symlinks, using file-info and link-info which map to C system calls stat and lstat. Some slots are shared across all platforms while others are only present on a particular platform. There are symbols representing all of the file types, like +regular-file+, +directory+, and +symbolic-link+. Here are some examples.
MacOSX
( scratchpad ) "resource:license.txt" file-info .
T{ bsd-file-info
{ type +regular-file+ }
{ size 1252 }
{ permissions 33188 }
{ created
T{ timestamp
{ year 2008 }
{ month 11 }
{ day 17 }
{ hour 23 }
{ minute 34 }
{ second 5 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ modified
T{ timestamp
{ year 2008 }
{ month 11 }
{ day 17 }
{ hour 23 }
{ minute 34 }
{ second 5 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ accessed
T{ timestamp
{ year 2008 }
{ month 12 }
{ day 9 }
{ hour 12 }
{ minute 34 }
{ second 8 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ uid 501 }
{ gid 20 }
{ dev 234881026 }
{ ino 992362 }
{ nlink 1 }
{ rdev 0 }
{ blocks 8 }
{ blocksize 4096 }
{ birth-time
T{ timestamp
{ year 2008 }
{ month 11 }
{ day 17 }
{ hour 23 }
{ minute 34 }
{ second 5 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ flags 0 }
{ gen 0 }
}
Windows XP
( scratchpad ) "resource:license.txt" file-info .
T{ windows-file-info
{ type +regular-file+ }
{ size 1252 }
{ permissions 32 }
{ created
T{ timestamp
{ year 2008 }
{ month 3 }
{ day 23 }
{ hour 23 }
{ minute 28 }
{ second 12 }
}
}
{ modified
T{ timestamp
{ year 2008 }
{ month 3 }
{ day 27 }
{ hour 23 }
{ minute 24 }
{ second 12 }
}
}
{ accessed
T{ timestamp
{ year 2008 }
{ month 9 }
{ day 19 }
{ hour 23 }
{ minute 8 }
{ second 41 }
}
}
{ attributes { +archive+ } }
}
FreeBSD
( scratchpad ) "resource:license.txt" file-info .
T{ bsd-file-info
{ type +regular-file+ }
{ size 1252 }
{ permissions 33188 }
{ created
T{ timestamp
{ year 2008 }
{ month 4 }
{ day 6 }
{ hour 12 }
{ minute 6 }
{ second 53 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ modified
T{ timestamp
{ year 2008 }
{ month 4 }
{ day 6 }
{ hour 12 }
{ minute 6 }
{ second 53 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ accessed
T{ timestamp
{ year 2008 }
{ month 4 }
{ day 6 }
{ hour 12 }
{ minute 6 }
{ second 59 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ uid 1002 }
{ gid 1002 }
{ dev 89 }
{ ino 343452 }
{ nlink 1 }
{ rdev 1348575 }
{ blocks 4 }
{ blocksize 4096 }
{ birth-time
T{ timestamp
{ year 2008 }
{ month 4 }
{ day 6 }
{ hour 12 }
{ minute 6 }
{ second 53 }
{ gmt-offset T{ duration { hour -6 } } }
}
}
{ flags 0 }
{ gen 0 }
}

File Systems

The file-system utility word above works on file-system tuples that contain cross-platform information like the device name, the mount point, the number of free, used, and total bytes. A file-system tuple on Unix has all of the file-system information found in both statfs and statvfs while a Windows file-system object has the device-id, volume name, and byte usage slots. There is not a single win32 API call that gives as much information as on Unix systems -- instead I call a combination of GetDiskFreeSpaceEx, FindFirstVolume, GetVolumePathNamesForVolumeName, GetVolumeInformation.
On every Unix besides Linux, there is a member of the statfs or statvfs structure that gives you the file-system that contains the file. So, I had to roll my own for it to work the same way across all platforms. The algorithm is pretty simple: the follow-links word follows links up to 10 times (configurable) and once it stops, finds the parent directory and follows the links again until the directory is a member of the directories in the /etc/mtab file. If there is circularity or a broken link, it throws an error.

FreeBSD
( scratchpad ) "/" file-system-info .
T{ freebsd-file-system-info
{ device-name "/dev/da0s1a" }
{ mount-point "/" }
{ type "ufs" }
{ available-space 368117760 }
{ free-space 409702400 }
{ used-space 110110720 }
{ total-space 519813120 }
{ block-size 2048 }
{ preferred-block-size 2048 }
{ blocks 253815 }
{ blocks-free 200050 }
{ blocks-available 179745 }
{ files 65790 }
{ files-free 61501 }
{ files-available 61501 }
{ name-max 255 }
{ flags 20480 }
{ id { 0 0 } }
{ version 537068824 }
{ io-size 16384 }
{ owner 0 }
{ syncreads 0 }
{ syncwrites 0 }
{ asyncreads 0 }
{ asyncwrites 0 }
}
Windows XP 64
( scratchpad ) "k:\\" file-system-info .
T{ win32-file-system-info
{ device-name "" }
{ mount-point "k:\\" }
{ type "NTFS" }
{ available-space 174530142208 }
{ free-space 174530142208 }
{ used-space 225547444224 }
{ total-space 400077586432 }
{ max-component 255 }
{ flags 459007 }
{ device-serial 3695676537 }
}
The Factor build farm will start using file-system-info to report when a drive fills up pretty soon.

Sorting by tuple slots

Factor's sorting has been extended to support sorting tuples by multiple slots, one after the other. In a music player application, you may wish to sort your songs first by artist, then by album, then track number. If you had a tuple representing every song, the code for such a sort is now very easy:
{ { artist>> <=> } { album>> <=> } { track>> <=> } } sort-by-slots
The main word at work is sort-by-slots ( seq sort-specs -- seq' ), where a sort-spec is an accessor paired with a comparator, one of <=>, >=<, human-<=>, human->=<. Human sort first converts consecutive digits to integers and then makes the comparison, e.g. { "a1" "a10" "a03" "a2" } sorts as { "a1" "a2" "a03" "a10" } instead of { "a03" "a1" "a10" "a2" }, as it would with natural-sort.

Sorting in reverse order is possible with the new operator >=<, which inverts the result of the <=> comparator. To sort a playlist by the most played songs in reverse order (most played first):
{ { play-count>> >=< } } sort-by-slots

Source code for sort-by-slots

: slot-comparator ( accessor comparator -- quot )
'[ [ _ execute ] bi@ _ execute dup +eq+ eq? [ drop f ] when ] ;

MACRO: compare-slots ( sort-specs -- <=> )
#! sort-spec: { accessor comparator }
[ first2 slot-comparator ] map '[ _ 2|| +eq+ or ] ;

: sort-by-slots ( seq sort-specs -- seq' )
'[ _ compare-slots ] sort ;
First, a bit about how Factor's sort ( seq quot -- sortedseq ) word works. It expects a quotation that compares two objects lexicographically, which is element by element in dictionary order. Thus, the macro compare-slots expands the sort-specs into a quotation that compares tuples slot-by-slot until there is a difference. The macro-expansion for the first example looks like this:
( scratchpad ) [ { { artist>> <=> } { album>> <=> } { track>> <=> } } compare-slots ] expand-macros .
f 2 [
\ artist>> \ <=> [ [ execute ] curry bi@ ] dip execute
dup +eq+ eq? [ drop f ] when
] [ [ drop ] dip ndup ] dip call dup
[ 2 [ ndrop ] curry dip ] [
2 [
\ album>> \ <=> [ [ execute ] curry bi@ ] dip execute
dup +eq+ eq? [ drop f ] when
] [ [ drop ] dip ndup ] dip call dup
[ 2 [ ndrop ] curry dip ] [
2 [
\ track>> \ <=> [ [ execute ] curry bi@ ] dip
execute dup +eq+ eq? [ drop f ] when
] [ [ drop ] dip ndup ] dip call dup
[ 2 [ ndrop ] curry dip ]
[ 2 [ drop ] dip ndrop t [ f ] [ no-cond ] if ] if
] if
] if +eq+ or

The macro first extracts the slots and compares with the first comparator. If the comparator returns +lt+ or +gt+ then the comparison is over, but if the comparator returns +eq+ then the next comparison is invoked and the sort continues in the same way. Notice that the slot-comparator word replaces +eq+ with f to avoid short-circuiting the iteration done by 2||. The final sort-by-slots word is a trivial call to sort with the new comparator word we just defined.

Algorithmic complexity and ideas for improvement

The sorting bound is still O(n log n) for time, but of course the more comparators that you need to break ties, the longer the algorithm will take to run.
If your data has to be sorted anyway, it's possible to sort and then split the data into related slices using a sequence of accessors like the above. You could then compute the playing time of every album, or find your most-played song from every artist. Splitting with the same accessors takes O(n) time, but the odds are that you probably won't have already sorted the data. So for unsorted data, there is a more efficient way to extract features -- you can instead partition it in O(n) time and find features on each partition, avoiding sorting altogether. This will be the subject of a future blog post.