Reducing the Size of Device Images

While device images can be helpful to keep, they can be a pain to store. They must literally contain every byte of the original device, making them about as big as the original image. However, saving unused disk space is pointless. By zeroing out unused filesystem space and then using fast compression on the resulting image, the compressed will be about the same size as the used data on the filesystem. If a significant portion of the filesystem is unused, this can save a lot of space in final image.

Making a device image is not difficult. Unix-based systems have long had the "dd" utility. For Windows, Vista introduced a built-in utility for creating image backups. There are also many third-party software applications like Partition Magic provide this sort of functionality.

The biggest disadvantage of a raw device image is that space unused by the filesystem is still saved. If a 20 GB file system only has 4 GB of data on it, all 20 GB will still be saved in the image. If there are many images that need to be stored (such as images from multiple devices, or images from multiple points in time from the same device) this isn't very space efficient. Users with space concerns often compress the images, but the unused space often contains normal data since it is likely that it was used at least once in the past as files were created, deleted, and moved. So, unfortunately, the unused portion of the filesystem usually compresses only a little better than the used portion. Since the contents of the unused portion of the filesystem are often arbitrary, it is undesirable to have such high overhead for storing then.

(Note that filesystem based images do not have this problem. But these are more complicated and do not work on full disks, which is necessary to preserve the MBR, partition boundaries, etc.)

However, if the unused space of the file system is filled with zeros before compression then the image containing the filesystem is compressed the unused space will be trivially compressed to almost nothing. Compressing 4 GB of data is practically the same as compressing 4 GB of data and 16 GB of zeros.

Implementation

Zeroing out unused filesystem space is simple. First mount the file system to be imaged, then create a temporary file on the file system and fill this file with zeros until either a) the file system runs out of space, or b) the file system's maximum file size limit is reached. In the case of the latter, continue creating and filling temporary files until the file system is full. Delete all the temporary files once you are finished. At this point practically all unused space has been allocated for one of the zero-filled files that was created, and thus has had physical zeros written to it.

On a Unix/Linux system, the dd utility makes this easy. The following command:

$ dd if=/dev/zero of=/my/file/zero.tmp bs=16M

reads from virtual device /dev/zero which supplies and unlimited quantity of zeros and writes the zeros to an output file, automatically terminating when the file can not grow anymore. The argument bs=16M is included to speed up the operation, since by default dd will read and write in chunks of 512 bytes and the constant switching between read and write operations in very inefficient and can make the process take tens of times longer.

I've written a quick platform independent C++ program that will create files full of zeros until the files are as large as they can grow and no more files can be created. While "dd" is certainly more convenient, this should work on Windows systems and on filesystems that don't support sufficiently large files. Execute this program with one argument pointing to a path on the partition you want to zero-ize, no argument will default to the current working path.

Obviously, it may not be a good idea to perform this zero-ization operation on a filesystem that is in active use. After the filesystem is filled, but before the temporary file(s) are deleted, there will be almost no room to write data to disk. While this will be a very small window of time, any applications (including the operating system) that need to write to disk will possibly be denied the ability to do so and since it is rare for applications to be denied write access to open file handles, their behavior may be unpredictable. In the majority of my own personal tests I have not encountered a problem, but a couple times the system froze or slowed down noticeably until I deleted the temporary files. Just be careful, filling a live filesystem to the brim is not standard good practice.

There are at least couple technical issues that prevent the above "files with zeros" approach from completely zeroing out unused space on the filesystem.

  • Partially-used sectors will not have the unused portion zeroed out. But these sectors will represent a negligible percentage of the total disk. They typically occur as the last sector of a file that isn't an integral multiple of the sector size. From quick empirical testing, a typical Windows 7 install will probably have less than 500,000 files, and 500,000 sectors means about 125 MB of space on average.

  • Writing to and then deleting a file does not guarantee the file to be written to disk due to both the OS/filesystem and disk-level caches. Cached data that never gets written to disk will be abandoned when the file is deleted, and the space on disk it was supposed to zero will be left untouched. But only a small portion of the data to be written will be cache-able by the OS and the disk. Home disks rarely have larger than 32 MB caches, and the OS/filesystem will likely cache at most a gigabyte or so. This has the potential to be non-negligible size, but even an aggressive cache would have a small total impact. Since so much is being written to disk, the caches will overflow quickly and be forced to write most of it to disk.

  • While it should be obvious, this is just a "good enough" approach. We probably don't care about an extra couple hundred MB of space in the image (and we're definitely not relying on this for security).

    A fast compression scheme is probably better than a good compression scheme, unless time is not important. It's likely that the majority of used space will be binary data (executables, already-compressed media formats, etc) and will yield very low compression no matter how hard you try. Any simple compression method, like GZIP, will be able to make efficient use of the sections of 0s and not waste too much time compressing the rest of the image.

    Concluding Notes

    The difference in compressed image between the original device contents and the zero-ized device contents will depend on the filesystem(s) involved, how full they are, how much data has been added and deleted, how long it's been since the last time the device(s) were wiped, and similar factors. However, since this is a fairly easy procedure, it wouldn't hurt to try this if saving device image backup space is helpful. In my personal experience, I've seen the size of the compressed image as much as halved. On a device where not much data is copied, this may only need to be applied once or twice in the lifetime of the device to keep the majority of the unused space zeroed.

    Somewhat obviously, this technique should not be used on a device that requires forensic analysis, as sectors unclaimed by the filesystem may still have contents that need to be examined.