Thursday, December 13, 2007

Sorting Filenames Containing Numbers

Jeff Atwood articulated a problem I usually run into when looking at files in Explorer. If you have a bunch of files with numbers, file "a100.txt" will come before file "a2.txt" when sorted by name. Of course, two is less than 100, so you'd expect "a2.txt" to come first.

Here's a solution in Factor:

: human-sort ( seq -- newseq )
[ [ digit? ] cut3 >r string>number r> 3array ] map natural-sort
[ first3 >r number>string r> 3append ] map ;

{ "a100.txt" "a10.txt" } human-sort .
{ "a10.txt" "a100.txt" }


The algorithm is simple: split a filename at the first number, convert that from a string to a number, and let Factor's natural-sort do the rest.

It maps over the sequence, { "a100.txt" "a10.txt" }, and cuts it at the first number inside each element so, after cut3, you get the result on the stack like "a" "100" ".txt", and "a" "10" ".txt". You dip under the top element and convert the to a number, >r string>number r>, and 3array to make a sequence of sequences { { "a" 100 ".txt" } { "a" 10 ".txt" } }. You then call natural-sort on this. Natural-sort knows how to compare element by element, so seeing the "a" "a" is equal it moves onto the next comparison, 100 10. Now that it's in the right order, you still need to turn your sequence back into strings. natural-sort returned { { "a" 10 ".txt" } { "a" 100 ".txt" } }, so map (do something to each element in the sequence) over this and do first3 >r number>string r> 3append, which explodes the array, converts back to a string, and appends three strings together to get the filenames. The dot at the end prints it.

I had to write cut3, but it's a generally useful word for partitioning a sequence that I put into sequences.lib.


: cut3 ( seq pred -- head match tail )
2dup find drop [
rot swap cut rot [ not ] compose
dupd find drop [ cut ] [ f ] if*
] [
drop nip f like f f
] if* ;


UPDATE:

I generalized this to work on { "a100b200.txt" "a100b2.txt" } with some suggestions from Slava. If you understood the previous code, this should be pretty understandable too. It keeps calling cut3 until it returns no matches and accumulates the result in a sequence, all the while applying a quotation to whatever matched the predicate. It then compares with natural-sort like before, and transforms these sequences back into filenames. cut3 is cleaned up in this version. I ended up with two general words, cut3 and cut-all that can be put into a library, sequences.lib, and used elsewhere.


: cut-find ( seq pred -- before after )
dupd find drop dup [ cut ] when ;

: cut3 ( seq pred -- first mid last )
[ cut-find ] keep [ not ] compose cut-find ;

: (cut-all) ( seq pred quot -- )
[ >r cut3 r> dip >r >r , r> [ , ] when* r> ] 2keep
pick [ (cut-all) ] [ 3drop ] if ;

: cut-all ( seq pred quot -- seq )
[ (cut-all) ] { } make ;


: human-sort ( seq -- newseq )
[ [ digit? ] [ string>number ] cut-all ] map natural-sort
[ [ dup string? [ number>string ] unless ] map concat ] map ;

{ "a100b2.txt" "a100b200.txt" } human-sort .
{ "a100b2.txt" "a100b200.txt" }


UPDATE TWO:

Here is a much better algorithm that doesn't reconstruct the original filenames and thus can handle leading zeros. It constructs keys that can be sorted by sort-values or sort-keys (sort-values is just as efficient here and it lets you use dup in the map instead of a [ ] keep, which makes the code slightly more legible).

: human-sort ( seq -- newseq )
[ dup [ digit? ] [ string>number ] cut-all ] { } map>assoc
sort-values keys ;

{ "000a.txt" "00a.txt" } human-sort .
{ "00a.txt" "000a.txt" }