Filling a file with zeros by seeking
The best way of writing a file full of zeros is to seek to one byte from the end of the file, write a zero, and close the file. Here's the code:: (zero-file) ( n path -- )The first thing you'll notice about the
binary
[ 1 - seek-absolute seek-output 0 write1 ] with-file-writer ;
ERROR: invalid-file-size n path ;
: zero-file ( n path -- )
{
{ [ over 0 < ] [ invalid-file-size ] }
{ [ over 0 = ] [ nip touch-file ] }
[ (zero-file) ]
} cond ;
zero-file
is that we special-case negative and zero file sizes. Special-casing zero file length is necessary to avoid seeking to -1, which does everything correctly but throws an error in the process instead of returning normally. Special-casing negative file sizes is important because it's always an error, and though the operation fails overall, the file-system can become littered with zero-length files that are created before the exception is thrown.To call the new word:
123,456,789 "/Users/erg/zeros.bin" zero-file
"/Users/erg/zeros.bin" file-info size>> .
123456789
Copying a zero-stream
With Factor's stream protocol, you can write new kinds of streams that, when read from or written to, do whatever you want. I wrote a read-onlyzero-stream
below that returns zeros whenever you read from it. Wrapping a limit-stream
around it, you can give the inexhaustible zero-stream
an artificial length, so that copying it reaches an end and terminates.TUPLE: zero-stream ;The drawback to this approach is that it creates 8kb byte-arrays in memory that it immediately writes to disk.
C: <zero-stream> zero-stream
M: zero-stream stream-read drop <byte-array> ;
M: zero-stream stream-read1 drop 0 ;
M: zero-stream stream-read-partial stream-read ;
M: zero-stream dispose drop ;
:: zero-file2 ( n path -- )
<zero-stream> n limit-stream
path binary <file-writer> stream-copy ;
Setting the contents of a file directly
Using theset-file-contents
word, you can just assign a file's contents to be a sequence. However, this sequence has to fit into memory, so this solution is not as good for our use case.:: zero-file3 ( n path -- )
n <byte-array> path binary set-file-contents ;
Bonus: writing random data to a file
The canonical way of copying random data to a file in Unix systems is to use the dd tool to read from /dev/urandom and write to a file. But what about on Windows, where there is no /dev/urandom? We can come up with a cross-platform solution that uses method number two from above, but instead of azero-stream
, we have a random-stream
. But then what about efficiency? Well, it turns out that Factor's Mersenne Twister implementation generates random numbers faster than /dev/urandom on my Macbook -- writing a 100MB file from /dev/urandom is about twice as slow as a Factor-only solution. So not only is the Factor solution cross-platform, it's also more efficient.TUPLE: random-stream ;Here are the results:
C: <random-stream> random-stream
M: random-stream stream-read drop random-bytes ;
M: random-stream stream-read1 drop 256 random ;
M: random-stream stream-read-partial stream-read ;
M: random-stream dispose drop ;
:: stream-copy-n ( from to n -- )
from n limit-stream to stream-copy ;
:: random-file ( n path -- )
path binary <file-writer> n stream-copy-n ;
! Read from /dev/urandom
:: random-file-urandom ( n path -- )
[
path
binary <file-writer> n stream-copy-n
] with-system-random ;
$ dd if=/dev/urandom of=here.bin bs=100000000 count=1
1+0 records in
1+0 records out
100000000 bytes transferred in 17.384370 secs (5752294 bytes/sec)
100,000,000 "there.bin" random-file
Running time: 5.623136439 seconds
Conclusion
Since Factor has high-level libraries that wrap the low-level libc and system calls used for nonblocking i/o, we don't have to deal with platform-specific quirks at this level of abstraction like handling EINTR, error codes, or resource cleanup at the operating system level. When calls get interrupted, when errno is set to EINTR after the call returns, the i/o operation is simply tried again behind the scenes, and only serious i/o errors get thrown. There are many options for correct resource cleanup should an error occur, but the error handling code we used here is incorporated into thestream-copy
and with-file-writer
words--resources are cleaned up regardless of what happens. We also demonstrated that a Factor word is preferable to a shell script or the dd command for making files full of random data because it's more portable and faster, and that custom streams are easy to define.Finally, there's actually a faster way to create huge files full of zeros, and that's by using sparse files. Sparse files can start off using virtually no file-system blocks, but can appear to be as large as you wish, and only start to consume more blocks as parts of the file are written. However, support for this is file-system dependent and, overall, sparse files are of questionable use. On Unix file-systems that support sparse files, the first method above should automatically creates them with no extra work. Note that on MacOSX, sparse file-systems are supported but not enabled by default. On Windows, however, you have to make a call to
DeviceIoControl
. If someone wants to have a small contribution to the Factor project, they are welcome to implement creation of sparse files for Windows.Edit: Thanks to one of the commenters, I rediscovered that there's a Unix syscall
truncate
that creates zero-length files in constant time on my Mac. This is indeed the best solution for making files full of zeros, and although unportable, a Factor library would have no problem using a hook on the OS variable to call truncate on Unix and another method on Windows.