Saturday, September 04, 2010

Filling a file with zeros

In this blog post I'll demonstrate a few of ways to fill a file with zeros in Factor. The goal is to write a some number of bytes to file in the least amount of time and using only a small amount of RAM; writing a large file should not fail.

Filling a file with zeros by seeking

The best way of writing a file full of zeros is to seek to one byte from the end of the file, write a zero, and close the file. Here's the code:
: (zero-file) ( n path -- )
binary
[ 1 - seek-absolute seek-output 0 write1 ] with-file-writer ;

ERROR: invalid-file-size n path ;

: zero-file ( n path -- )
{
{ [ over 0 < ] [ invalid-file-size ] }
{ [ over 0 = ] [ nip touch-file ] }
[ (zero-file) ]
} cond ;
The first thing you'll notice about the zero-file is that we special-case negative and zero file sizes. Special-casing zero file length is necessary to avoid seeking to -1, which does everything correctly but throws an error in the process instead of returning normally. Special-casing negative file sizes is important because it's always an error, and though the operation fails overall, the file-system can become littered with zero-length files that are created before the exception is thrown.

To call the new word:
123,456,789 "/Users/erg/zeros.bin" zero-file
"/Users/erg/zeros.bin" file-info size>> .
123456789

Copying a zero-stream

With Factor's stream protocol, you can write new kinds of streams that, when read from or written to, do whatever you want. I wrote a read-only zero-stream below that returns zeros whenever you read from it. Wrapping a limit-stream around it, you can give the inexhaustible zero-stream an artificial length, so that copying it reaches an end and terminates.
TUPLE: zero-stream ;

C: <zero-stream> zero-stream

M: zero-stream stream-read drop <byte-array> ;
M: zero-stream stream-read1 drop 0 ;
M: zero-stream stream-read-partial stream-read ;
M: zero-stream dispose drop ;

:: zero-file2 ( n path -- )
<zero-stream> n limit-stream
path binary <file-writer> stream-copy ;
The drawback to this approach is that it creates 8kb byte-arrays in memory that it immediately writes to disk.

Setting the contents of a file directly

Using the set-file-contents word, you can just assign a file's contents to be a sequence. However, this sequence has to fit into memory, so this solution is not as good for our use case.
:: zero-file3 ( n path -- )
n <byte-array> path binary set-file-contents ;

Bonus: writing random data to a file

The canonical way of copying random data to a file in Unix systems is to use the dd tool to read from /dev/urandom and write to a file. But what about on Windows, where there is no /dev/urandom? We can come up with a cross-platform solution that uses method number two from above, but instead of a zero-stream, we have a random-stream. But then what about efficiency? Well, it turns out that Factor's Mersenne Twister implementation generates random numbers faster than /dev/urandom on my Macbook -- writing a 100MB file from /dev/urandom is about twice as slow as a Factor-only solution. So not only is the Factor solution cross-platform, it's also more efficient.
TUPLE: random-stream ;

C: <random-stream> random-stream

M: random-stream stream-read drop random-bytes ;
M: random-stream stream-read1 drop 256 random ;
M: random-stream stream-read-partial stream-read ;
M: random-stream dispose drop ;

:: stream-copy-n ( from to n -- )
from n limit-stream to stream-copy ;

:: random-file ( n path -- )

path binary <file-writer> n stream-copy-n ;

! Read from /dev/urandom
:: random-file-urandom ( n path -- )
[
path
binary <file-writer> n stream-copy-n
] with-system-random ;
Here are the results:
$ dd if=/dev/urandom of=here.bin bs=100000000 count=1
1+0 records in
1+0 records out
100000000 bytes transferred in 17.384370 secs (5752294 bytes/sec)

100,000,000 "there.bin" random-file
Running time: 5.623136439 seconds

Conclusion

Since Factor has high-level libraries that wrap the low-level libc and system calls used for nonblocking i/o, we don't have to deal with platform-specific quirks at this level of abstraction like handling EINTR, error codes, or resource cleanup at the operating system level. When calls get interrupted, when errno is set to EINTR after the call returns, the i/o operation is simply tried again behind the scenes, and only serious i/o errors get thrown. There are many options for correct resource cleanup should an error occur, but the error handling code we used here is incorporated into the stream-copy and with-file-writer words--resources are cleaned up regardless of what happens. We also demonstrated that a Factor word is preferable to a shell script or the dd command for making files full of random data because it's more portable and faster, and that custom streams are easy to define.

Finally, there's actually a faster way to create huge files full of zeros, and that's by using sparse files. Sparse files can start off using virtually no file-system blocks, but can appear to be as large as you wish, and only start to consume more blocks as parts of the file are written. However, support for this is file-system dependent and, overall, sparse files are of questionable use. On Unix file-systems that support sparse files, the first method above should automatically creates them with no extra work. Note that on MacOSX, sparse file-systems are supported but not enabled by default. On Windows, however, you have to make a call to DeviceIoControl. If someone wants to have a small contribution to the Factor project, they are welcome to implement creation of sparse files for Windows.

Edit: Thanks to one of the commenters, I rediscovered that there's a Unix syscall truncate that creates zero-length files in constant time on my Mac. This is indeed the best solution for making files full of zeros, and although unportable, a Factor library would have no problem using a hook on the OS variable to call truncate on Unix and another method on Windows.