Using TrueZIP File*

Abstract

This article explains the usage of the module TrueZIP File*.

Basic Operations

In order to create a new archive file, an application can simply call TFile.mkdir().

In order to delete it, an application can call TFile.delete(). Like a plain old directory, this is only possible if the archive file is empty. Alternatively, an application could call TFile.deleteAll() in order to delete the (virtual) directory in one go, regardless of its contents.

To read an archive entry, an application can simply create a TFileInputStream or a TFileReader object. Note that an application cannot use these classes to read a valid archive file itself.

Likewise, to write an archive entry, an application can simply create a TFileOutputStream or a TFileWriter object. Note that an application cannot use these classes to write a valid archive file itself.

If an application just needs to copy data however, using one of the copy methods in the class TFile is highly recommended instead of using TFile(In|Out)putStream or TFile(Reader|Writer) directly. These methods use pooled buffers and pooled threads and do not need to decompress/recompress archive entry data when copying from one archive file to another for supported archive types. In addition, they are guaranteed to close their streams if an IOException occurs.

Note that there is no eqivalent to RandomAccessFile in this package because it's impossible to seek within compressed archive entry data.

Properly Closing Archive Entry Streams

In general, when using streams, an application should always close them in a finally-block like this:

InputStream in = new TFileInputStream(path);
try {
    TFile.cat(in, System.out);
} finally {
    in.close(); // ALWAYS close the stream!
}

This ensures that the stream is closed even if an exception occurs.

Note that the OutputStream.close() method may throw an IOException, too. Applications need to deal with this appropriately, for example by enclosing the entire block within another try-catch-block:

try {
    InputStream in = new TFileInputStream(path);
    try {
        TFile.cat(in, System.out);
    } finally {
        in.close(); // ALWAYS close the stream!
    }
} catch (IOException ouch) {
    ouch.printStackTrace();
}

This idiom is not at all specific to TrueZIP: Streams often utilize OS resources such as file descriptors or network connections. All OS resources are limited however and sometimes they are even exclusively allocated for a stream, so the stream should always be closed as soon as possible again, especially in long running server applications - relying on Object.finalize() to do this during garbage collection is unsafe.

TrueZIP is affected by open archive entry streams in the following way: When unmounting an archive file (see below), depending on the parameters, TrueZIP may choose to force the closing of any open entry streams or not. If the entry streams are not forced to close, the archive file cannot get unmounted and an FsSyncException is thrown. If the entry streams are forced to close however, the archive file is unmounted and an FsSyncWarningException is thrown to indicate that any subsequent I/O operations on these entry streams other than (In|Out)putStream.close() will fail with an InputClosedException or OutputClosedException.

In order to prevent these exceptions, TrueZIP automatically closes entry streams when they are garbage collected. However, an application should not rely on this because the delay and order in which streams are processed by the finalizer thread is not specified and any data buffered by an output stream gets lost.

Committing Unsynchronized Changes To The Contents Of Archive Files

To provide random read/write access to archive files, TrueZIP needs to associate some state for every accessed archive file on the heap and in some temporary files while the application is operating on its federated file system.

TrueZIP automatically mounts the federated file system from an archive file on the first read or write access. The application can then operate on the federated file system in arbitrary manner. Finally, an archive file must get unmounted in order to commit any unsynchronized changes to its contents or meta data.

Manual Versus Automatic Synchronization

Archive file synchronization is performed semi-automatic:

  • Manual synchronization happens when the application calls TVFS.umount(), which calls TVFS.sync(BitField<FsSyncOption>), or any of their many variants and incarnations.
  • Automatic synchronization happens when the JVM terminates (in a JVM shutdown hook) or when the application modifies an archive entry more than once and the respective archive controller is implementing a full update (which is the default update strategy unless FsOutputOption.GROW has been set).

Manual synchronization is required to support third-party access to an archive file (see below). It also enables some control over any exceptions thrown: TVFS.umount() or TVFS.sync(BitField<FsSyncOption>) may throw an FsSyncException. An application may catch these exceptions and act on them individually (see below).

However, calling TVFS.sync(BitField<FsSyncOption>) too often may increase the overall runtime: With the full update strategy, if an application is manually synchronizing a mounted archive file after each modification, then this results in an overall runtime of O(s*s), where s is the size of the target archive file in bytes (see below).

In general, avoiding manual synchronization provides best performance because automatic synchronization only happens if there's really a need to. It also works reliably because it's run in a JVM shutdown hook, too, which is always run unless the JVM crashes (note that an uncatched Throwable terminates the JVM, but does not crash it).

The disadvantage is that you cannot easily catch and deal with any exceptions thrown as a result of an automatic synchronization because it may happen on any write operation in a file system controller. If it happens within the JVM shutdown hook, you can't even catch the exception. It's stack trace is printed to standard error output instead.

In addition, synchronizing an existing archive file takes linear runtime (see below). However, using long running JVM shutdown hooks to perform an update to a large archive file is generally discouraged.

Third Party Access

Whenever a TrueZIP application accesses an archive file for reading or writing, the TrueZIP Kernel associates some state on the heap with it. This archive file is then said to be mounted. The TrueZIP Kernel manages the associated state until the archive file gets unmounted again, e.g. by a call to TVFS.umount().

Due to this strategy, the TrueZIP Kernel assumes exclusive access to any mounted archive files. THIRD PARTIES MUST NOT CONCURRENTLY ACCESS MOUNTED ARCHIVE FILES! In this context, third parties are:

  1. Instances of the class TFile which do not detect the same set of archive files in a path name due to the use of a differently configured TArchiveDetector.
  2. Instances of the class File which are not instances of the class TFile.
  3. Any TrueZIP classes which have been defined by a different class loader.
  4. Any other operating system processes.

FAILURE TO COMPLY WITH THESE CONSTRAINTS MAY RESULT IN UNPREDICTABLE BEHAVIOR AND EVEN CAUSE LOSS OF DATA!

As a rule of thumb, to ensure that all TFile objects recognize the same set of archive files in path names, it's recommended to avoid using constructors or methods of the TFile class with an explicit TArchiveDetector parameter.

If these restrictions cannot be respected for some reason, an application may call TVFS.umount(), TVFS.umount(TFile) or any of their many variants and incarnations (e.g. TVFS.sync(BitField<FsSyncOption>)). in order to commit all changes and finally unmount the archive files. A third party can then safely access them. Likewise, the application must make sure not to access an archive file or any of its entries while the third party is accessing it.

Exception Handling

TVFS.umount() and TVFS.sync(BitField) and their many variants and incarnations are guaranteed to process all archive files which are mounted or have been updated by an application. However, processing some of these archive files may fail for any I/O related reason. Hence the TrueZIP Kernel assembles a sequential I/O exception chain from any I/O exception which occurs during synchronization of the archive files. Upon finishing the processing, when this chain is not empty, it gets sorted according to (1) descending order of priority and (2) ascending order of appearance, and finally the head of the resulting exception chain gets thrown.

Note that sequential I/O exception chaining is a concept which is completely orthogonal to Java's general exception cause chaining: In a sequential I/O exception chain, each exception may still have a chain of other exceptions as its cause.

When catching a sequential I/O exception chain, an application could either use it as it is or rearrange this chain striclty in order of appearance. The following snippet demonstrates the default priority/appearance sorting order if a sequential I/O exception chain is catched and printed:

try {
    TVFS.umount(); // with or without parameters
} catch (FsSyncException ouch) {
    // Print the sequential I/O exception chain in order of descending
    // priority and ascending appearance.
    // This is the default so you wouldn't have to call sortPriority().
    ouch.sortPriority().printStackTrace();
    //ouch.printStackTrace(); // equivalent
}

Rearranging a sequential I/O exception chain strictly in order of appearance is done like this instead:

try {
    TVFS.umount(); // with or without parameters
} catch (FsSyncException ouch) {
    // Print the sequential I/O exception chain strictly in order of
    // ascending appearance instead.
    ouch.sortAppearance().printStackTrace();
}

In either case, at most the first three sequential I/O exceptions in the chain get printed by default. You can change this by calling SequentialIOException.setMaxPrintExceptions(int).

The default priority/appearance sorting order enables applications to selectively catch the following sequential I/O exception chain sub-classes:

  1. The class FsSyncException is the root of the synchronization exception chain types. An exception of this type is thrown if an archive file could not get updated and some or all of its data got lost.
  2. The class FsSyncWarningException is the root of the synchronization warning exception chain types. An exception of this type is thrown if an archive file has been updated, but some warning conditions apply. No data has been lost, however.

Now, because FsSyncWarningExceptions have a lower priority than FsSyncExceptions, they are always pushed back to the end before the exception chain is thrown, so that an application could use the following snippet to detect if only some warnings or at least one severe error condition has occured:

try {
    TVFS.umount(); // with or without parameters
} catch (FsSyncWarningException oops) {
    // Only objects of the class FsSyncWarningException exist in
    // the exception chain - we ignore this.
} catch (FsSyncException ouch) {
    // At least one exception occured which is not just an
    // FsSyncWarningException.
    // This indicates loss of data and needs to be handled.
    // Print the sequential I/O exception chain in order of
    // descending priority and ascending appearance.
    ouch.printStackTrace();
    //ouch.sortPriority().printStackTrace(); // equivalent
}

Performance Considerations

Synchronizing a mounted archive file is a linear runtime operation: If the size of the target archive file is s bytes, the operation always completes in O(s), even if only a single, small archive entry has been modified within a very large archive file. Unmounting an unmodified or newly created archive file is a constant time operation, i.e. it completes in O(1). Note that the complexity is independent of whether synchronization was performed manually or automatically.

Now if an application modifies each entry in a loop and accidentally triggers synchronizing the archive file on each iteration, then the overall runtime increases to O(s*s)! Here's an example:

String[] names = {"a", "b", "c"};
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    TFile entry = new TFile("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // O(1)
    TVFS.umount(); // O(i + 1) !!
}
// Overall: O(n*n) !!!

The bad run-time performance is because File.umount() is called within the loop. Moving it out of the loop fixes the issue:

String[] names = {"a", "b", "c"};
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    TFile entry = new TFile("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // O(1)
}
TVFS.umount(); // new file: O(1); modified: O(n)
// Overall: O(n)

In essence: If at all, an application should never call TVFS.umount() or TVFS.sync(BitField) in a loop which modifies an archive file.

The situation gets more complicated with implicit remounting: If a file entry shall get modified which already has been modified before, TrueZIP implicitly remounts the archive file in order to avoid writing duplicated entries to it (which would waste space and may even confuse other utilities). Here's an example:

String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    TFile entry = new TFile("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // First modification: O(1)
    entry.createNewFile(); // Second modification triggers remount: O(i + 1) !!
}
// Overall: O(n*n) !!!

Each call to TFile.createNewFile() is a modification operation. Hence, on the second call to this method, TrueZIP needs to do an automatic synchronization which writes all entries in the archive file created so far to its parent file system again.

Unfortunately, a modification operation is not always so easy to spot. Consider the following example to create an archive file with empty entries which all share the same last modification time:

long time = System.currentTimeMillis();
String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    TFile entry = new TFile("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // First modification: O(1)
    entry.setLastModified(time); // Second modification triggers remount: O(i + 1) !!
}
// Overall: O(n*n) !!!

When TFile.setLastModified(long) gets called, the entry has already been written and so an implicit remount is triggered, which writes all entries in the archive file created so far to disk again.

Detail: This deficiency is caused by archive file formats: All currently supported archive types require to write an entry's meta data (including the last modification time) before its content to the archive file. So if the meta data is to be modified, the archive entry and hence the whole archive file needs to get rewritten, which is what the automatic synchronization is doing.

To avoid accidental synchronization when copying data, please consider using the advanced copy methods instead. These methods are easy to use and provide best performance.

Conclusions

Here are some guidelines to find the right balance between performance and control:

  1. When the JVM terminates, calling TVFS.umount() or TVFS.sync(BitField) is recommended in order to handle exceptions explicitly.
  2. Otherwise, in order to achieve best performance, TVFS.umount() or TVFS.sync(BitField) should not get called unless either third party access or explicit exception handling is required.
  3. For the same reason, these methods should never get called in a loop which modifies the contents of an archive file.

Atomicity of File System Operations

In general, a file system operation is either atomic or not. In its strict sense, an atomic operation meets the following conditions:

  1. The operation either completely succeeds or completely fails. If it fails, the state of the file system is not changed.
  2. Third parties can't monitor or influence the changes as they are in progress - they can only observe the result.

All reliable file system implementations meet the first condition and so does TrueZIP. However, the situation is different for the second condition:

  • TrueZIP's virtual file system implementation is running in a JVM process, so other processes could monitor or influence changes in progress.
  • TrueZIP's recognition of archive files is configurable, so other File instances could monitor or influence changes in progress.
  • TrueZIP maintains state information about archive files on the heap and in temporary files, so other definitions of the classes in this package which have been loaded by other class loaders could monitor or influence changes in progress.

This implies that TrueZIP cannot provide any operations which are atomic in its strict sense. However, many file system operations in this package are declared to be virtually atomic according to their Javadoc. A virtually atomic operation meets the following conditions:

  1. The operation either completely succeeds or completely fails. If it fails, the state of the (virtual) file system is not changed.
  2. If the path does not contain any archive files, the operation is always delegated to the real file system and third parties can't monitor or influence the changes as they are in progress - they can only observe the result.
  3. Otherwise, all TFile instances which recognize the same set of archive file types and are defined by the same class loader can't monitor or influence the changes as they are in progress - they can only observe the result.

These conditions apply regardless of whether the TFile instances are used by different threads or not. In other words, TrueZIP is thread safe as much as you could expect from a platform file system.

Miscellaneous

Virtual Directories within Archive Files

The top level entries in an archive file populate its virtual root directory. The root directory is never written to the output when an archive file is modified.

For an application, the root directory behaves like any other directory and is addressed by naming the archive file in a path: For example, an application may list its contents by calling TFile.list() or TFile.listFiles()

An archive may contain directories for which no entry is present in the file although they contain at least one member in their directory tree for which an entry is actually present in the file. Similarly, if TFile.isLenient() returns true (which is the default), an archive entry may be created in an archive file although its parent directory hasn't been explicitly created by calling TFile.mkdir() before.

Such a directory is called a ghost directory: Like the root directory, a ghost directory is not written to the output whenever an archive file is modified. This is to mimic the behavior of most archive utilities which do not create archive entries for directories.

To the application, a ghost directory behaves like a regular directory with the exception that its last modification time returned by TFile.lastModified() is 0L. If the application sets the last modification time explicitly using TFile.setLastModified(long), then the ghost directory reincarnates as a regular directory and will be output to the archive file.

Mind that a ghost directory can only exist within an archive file, but not every directory within an archive file is actually a ghost directory.

Entry Names in Archive Files

File paths may be composed of elements which either refer to regular nodes in the real file system (directories, files or special files), including top level archive files, or refer to entries within an archive file.

As usual in Java, elements in a path which refer to regular nodes may be case sensitive or not in TrueZIP's VFS, depending on the real file system and/or the platform.

However, elements in a path which refer to archive entries are always case sensitive. This enables an application to address all files in existing archive files, regardless of the operating system they've been created on.

If an entry name contains characters which have no representation in the character set of the corresponding archive file type, then all file operations to create the archive entry will fail gracefully according to the documented contract of the respective operation. This is to protect an application from creating archive entries which cannot get encoded and decoded again correctly. For example, the Euro sign (€) does not have a representation in the IBM437 character set and hence cannot be used for entry names in ordinary ZIP files unless the ZIP file system driver's configuration is customized to use another character set.

If an archive file contains entries with absolute entry names, such as /readme.txt rather than readme.txt, the application cannot address these entries using the VFS in this package. However, these entries are retained like any other entry whenever the application modifies the archive file.

If an archive file contains both a file and a directory entry with the same name it's up to the individual methods how they behave in this case. This could happen with archive files created by external tools only. Both TFile.isDirectory() and TFile.isFile() will return true in this case and in fact they are the only methods an application can rely upon to act properly in this situation: Many other methods use a combination of TFile.isDirectory() and TFile.isFile() calls and will show an undefined behavior.

The good news is that both the file and the directory coexist in the virtual archive file system implemented by this package. Thus, whenever the archive file is modified, both entries will be retained and no data gets lost. This enables an application to use another tool to fix the issue in the archive file. TrueZIP does not support to create such an archive file, however.