Thursday, October 22, 2009

Final Thoughts On: A Symbolic Puzzler

This is the final in a series of posts about a puzzler [ post containing the question ] [ post containing the answer ]. In those two posts I highlighted some bizarre Java weirdness as it pertained to the java.io.File.getCanonicalPath method, that canonicalized paths are cached, which means if a symbolic link changes somewhere the cached value becomes invalid.

Blatherberg

This problem really only ever crops up when your code relies on calls to getCanonicalPath and the links change while the application is running. If you expect your application to run against a filesystem with shifting symbolic links, you have to do one of these three things:
  1. Disable the cache by setting the system property sun.io.useCanonCaches to false. (I would also like to briefly nod to the pedantic point that the string useCanonCaches is icky.)

    This seems like an easy out (particularly if your application is moderately complex, and deployed, and assuming your application doesn't rely on the default behavior,) but there's a reason Java comes with a canonicalization cache: performance. Reading symbolic links from disk can take time, especially if you do it a lot.

    Also, you might be running your application in a Java EE container along with other applications, in which case, you can't isolate the cache behavior to a single application.

  2. Stop using getCanonicalPath (and its sugary sibling, getCanonicalFile) in your application, and rely solely on the non-canonicalized path.

    Changing your infrastructure to rely on the symlink paths themselves and not their canonicalized values sounds good, but you might not have control over that code: your application may rely on an application infrastructure that already relies on getCanonicalPath, and then you're kind of screwed. It's easy to say that the cache should be disabled in the name of correctness, but if you're repeatedly resolving symbolic links files by the thousand, the time cost may be significant.

    This leads to the other way to look at this problem, which is the lack of accessibility and flexible control over the cache. You might want to cache calls in some circumstances yet not others. The use cases for cache control can be complex, and by hiding the complexity you get, well, surprises like this.

  3. Disallow an application's filesystem to redefine symbolic links.

    If you've got that power, go for it.

Help From JSR 203

There's actually some hope for the future, and that's JSR203: More New I/O APIs for the JavaTM Platform ("NIO.2") which is scheduled to be part of Java 7. Look back to the puzzler, which points out the use of Filesystem and UnixFilesystem classes. In JSR203, those ideas are explicit. The equivalent of java.io.File is java.nio.file.Path which exposes a method getFileSystem. That's right, the file system is no longer hidden from the user, and you can read all about java.nio.file.FileSystem here. You can have a file system that represents a thin layer on top of your disk, or one that caches all sorts of metadata from your disk, or, heck, create an in-memory implementation for high-speed storage! But the real benefit is that these filesystem implementations can be injected into your classes: no more need for a single static filesystem. Whereas java.io.File objects are created through a constructor, java.nio.file.Path objects are constructed through the FileSystem's getPath method.

This isn't disk i/o nirvana, unfortunately, because like the continuing transition from java.util.Date to java.util.Calendar to something more reasonable like org.joda.time.DateTime, there's still plenty of legacy code using the old and busted APIs. But it's a good start.

If you want some more information about JSR 203, here's a write-up by Alex Miller and a link to a JavaOne talk from 2008. The video is a bit out of date (for instance it highlights the notion of Path.get, which seems to be gone, thank goodness.) But it's got lots of great information about the JSR.

The Last Word

In the end, I want to highlight something underlying this entire journey: the choice to cache the values by default in the first place is just wrong. It reminds me of the saying (that seems to be attributed to Bill Harlan): "It's easier to optimize correct code than it is to correct optimized code.

3 comments:

konberg said...

A coworker pointed me to his internal blog today where he had a completely different problem with caching and python's regular expressions. His summary is much better:

Caching is evil. Either don't do it, or be really careful that you get the details exactly right. (One of my CS profs once said "Half of the problems in computer science can be solved by caching. The other half are the result of caching.")

Anonymous said...

Jep, Path.get is gone. Now say hello to Paths.get

konberg said...

Hello, Paths.get!

I am pleased to see the comment below, yet I wonder how many people will understand it, and still make it utterly impossible to refactor their code:

"Note that while this method is very convenient, using it will imply an assumed reference to the default FileSystem and limit the utility of the calling code. Hence it should not be used in library code intended for flexible reuse. A more flexible alternative is to use an existing Path instance as an anchor, such as:

Path dir = ...
Path path = dir.resolve("file");"