Search Engine (SE)

Motivation

Suppose you're writing the next generation SE which shall index all files in the file system, including all file entries of all archive files. Then you may need to support at least the following archive types:

  • TAR.GZ
  • TAR.BZ2
  • ZIP
  • JAR
  • WAR
  • EAR
  • ...

You can easily figure that writing the code to read and write archive files of all these types will be a cumbersome and tedious task because each archive type typically comes with its own API to read and write its archive files.

Searching Archive Files

Thanks to the module TrueZIP File*, here's your relief: The TFile class extends the java.io.File class in order to add the required functionality.

See how easy listing the contents of the top level directory of an archive file can be:

TFile[] entries = new TFile("archive.zip").listFiles();

Because this is identical to the code for reading plain directories in the file system, the required code size shrinks drastically. Now you only need to apply recursion to traverse the directory tree of any archive file.

Indexing the contents of any entry within an archive file works exactly the same way as it would work for plain files in the file system by using the TFileReader class

TFile entry = ...;
Reader reader = new TFileReader(entry);
try {
    ...;
} finally {
    reader.close(); // ALWAYS close the resource here!
}

You may be glad to know that this works recursively, too. So entry may refer to a plain file, an entry in an archive file or an entry in an archive file which is contained in another archive file and so on.

Configuring Archive Detection

You may be wondering how TrueZIP gets configured to treat a file name with a zip suffix as a ZIP file instead of a plain old file. For now, let it suffice to say that TrueZIP follows the convention-over-configuration principle as much as possible, so there are reasonable defaults for everything in order to relieve you from typical configuration tasks.

For the previous examples to work, the JARs of the driver module TrueZIP Driver ZIP needs to be present on the run time class path. You can do this by adding the Maven artifactId truezip-driver-zip as a dependency to the POM of your Maven build.

For more information about configuring the client APIs, please refer to the article Configuring TrueZIP File*.

False Positive Archive Files

Sometimes a file system entry may have a suffix which is configured to get recognized as an archive file, however the file is not an archive file or not even a file, e.g. a directory. This is called a false positive archive file or false positive for short.

TrueZIP safely detects any false positives and treats them according to their true state, that is, like a plain file or directory. This finding will be remembered until the next call to TVFS.umount(), so the performance impact is minimal.

Committing Changes / Cleaning Up

If your application has created or changed one or more archive files, then these changes need to get committed sometime. Even if your application has done read-only acess to the virtual file system, some temporary files may have been created to speed up random access - this dependends on the driver implementation.

If your application is only short-running, then there is actually nothing to do because the TrueZIP Kernel automatically registers and de-registers a JVM shutdown hook which will commit all changes when its run. Note that shutdown hooks are run even if the application terminates due to a Throwable.

However, if your application is long running or wants to handle any exceptions, then you may want to manually call this operation - here's how to do this:

TVFS.umount();

As a side effect, once this operation succeeded, third parties (e.g. other processes) can safely access the processed archive files until the next time your application starts to operate on them again.

Performance Considerations

Take care not to call TVFS.umount() in a loop which updates the same set of archive files because this would result in poor performance in the order of O(n*n) instead of just O(n), where n is the total number of archive entries.

For more information, please refer to the Javadoc for TVFS.umount().