Against Overreliance on Filenames For Metadata

Author: Alyssa Riceman

Posted: 2021-10-15

Updated: 2023-10-20

Several years ago, I was trying to extract some images out of an AZW6 container file. With some digging online, I found a script which purported to assist in doing just that, ran it against the container… and it returned an error claiming the target file wasn’t an AZW6 file. It took over an hour of haphazard debugging before I figured out what was wrong: the file’s name was <filename>.azw.res, and the script was assuming that the lack of .azw6 extension meant it wasn’t an AZW6 file, without even bothering to check its internals before throwing that error and shutting down. As soon as I renamed the file to <filename>.azw6 and re-ran the script against it, the extraction went off without a hitch.

This case is illustrative, I think, of why overreliance on filenames as a source of metadata is a bad habit which programmers should do their best to break themselves from.

Now, before I start going into more depth, I should note the scope of my claims here:

I take no issue with programs caring about precise filenames for their own internal files; there’s nothing wrong with, say, having a config.toml file next to your executable, packaged in as part of the program, and having the executable throw an error if that file is absent.
I take only minimal issue with programs which place rigid requirements on filenames for files which users will create for the express purpose of controlling those programs, as long as those formulae are unlikely to collide with the users’ other files. There’s nothing egregiously wrong, for instance, with requiring that gitignore files have the name .gitignore as opposed to some other name. (What issue I do take with this is largely a matter of personal aesthetics, outside of the scope of this post’s arguments.)
I take no issue with allowing users to use filename as one method out of many for feeding important metadata to a given program. My objection isn’t to allowing the use of filenames as a source of metadata, but only to requiring it.
I do take issue with any scenario in which a program requires that files created for purposes other than reading-by-that-program be renamed to fit a particular pattern before the program can productively interact with them.
Programs like Calibre which create internal copies of input files, rigidly named in that internal context but without requiring manual renaming of the external file, are an edge case which I may or may not take issue with depending on details, but are outside of the scope of this post’s arguments.

Why Not Use Filenames For Metadata?

In short: because filenames fill many different functions in many different contexts and you shouldn’t overload them more than necessary, because different people have different conventions they prefer to follow with their file-organization and are unlikely to benefit from being forced into uniformity, and because there are alternative metadata-tagging options available which lack these flaws.

In long: well, consider a file system. There are some common conventions—you’ve got a relatively-standardized structure for system files based on your OS, and then you’ve got a user folder containing a documents subfolder, a pictures subfolder, a music subfolder, a videos subfolder, and a few others—but there’s no standardization beyond that. Users will naturally organize their files into those folders, and sort them within those folders, according to those users’ personal senses of what makes for a sensible organizational scheme. Some people just throw everything onto their desktop, some (me) come up with carefully-thought-through bucketing patterns to ensure everything is sorted somewhere intuitive, some just toss everything into an automatic library-manager like Calibre or iTunes and let that program handle file-sorting from there, et cetera.

Crucially: filenames are as much a part of that process of organization as directory structure is. Different people are going to have different file naming schemes they find intuitive. Some will end up with a bunch of files called Untitled.png and Untitled (2).png and so forth, some will keep whatever name their files had upon initial download, some will do descriptive filenames like Pretty city skyline.png and EVEN PRETTIER city skyline.png, some will put in whatever bits of relatively-conventional metadata they find most important ($ARTISTNAME - $PIECENAME for each picture), et cetera. Many will follow different such schemes in different parts of their file system, based on what’s easiest and most intuitive in those respective parts.

As soon as a program starts demanding that users name their files in specific ways in order for the program to work correctly, the possibility of doing this with the relevant files falls apart. Needing to rename <filename>.azw.res to <filename>.azw6 breaks the “keep original filename from download” scheme, for example.

(As an aside, this objection holds just as strongly to demanding that files be in particular directories in order for the program to work right. When files need to be in C:\\Users\Alyssa\Dropbox in order for Dropbox to sync them, that forces all sorts of constraints on my ability to sort my files usefully. (Onto another hard drive, for example.) This is probably the single biggest reason I eventually switched over from Dropbox to MEGA: MEGA lets you designate arbitrary folders as sync-targets, rather than just creating a single one and forcing all syncing to be done through that folder.)

What Should Happen Instead?

Well, there are a lot of different things that different programs might try to use filenames as metadata for, and as such there are a lot of different answers here. But I’ll try to cover a few common cases, at least, and hopefully my answers here will be usefully extrapolable to use cases I haven’t covered.

1. Recognizing File Format

One common use case, as seen in the AZW6 example above, is to use file extensions in order to recognize the files’ formats. That example also neatly illustrates how that method can easily go wrong.

As another illustrative story to provide contrast here: once upon a time, I was a teenager just beginning my journey into Competence With Computers. I’d just figured out that PNG was a higher-quality image format than JPEG, and noticed that my browser let me change file extensions when I saved images. So, making some jumps of logic, I got into the habit of always saving images with .png extensions, irrespective of what extension my browser was trying to default to saving them with, on the assumption that this would lead my browser to convert the files to the higher-quality format while saving.

Despite the multiple layers of misunderstanding underlying this habit, and despite my carrying it on for several years before realizing my mistake, this has never had any negative effects on me besides occasional mild uncertainty about a given file’s true format. Because all the image viewers I’ve ever used have been smart enough to see straight through the deceptive file extensions. XNViewMP, Imagine, and even Windows Photo Viewer were all smart enough to look at the JPEGs labeled with .png extensions, see that their internal structure was JPEG-ish rather than PNG-ish, and display them accordingly.

The image viewers did exactly the right thing. Rather than demanding that metadata be provided purely by file extension, they actively ignored the deceptive file extensions and figured out files’ format through direct file-reading instead.

But some formats won’t necessarily support this so nicely. A file whose sole content is a UTF-8 newline character, for example, is thoroughly ambiguous; there are many formats it might be (markdown, plain text, shell script, et cetera), and none that it unambiguously is. The thing to do, in cases like this, is to guess based on file extension, but to make it easy for users to correct the guess if it’s wrong, and ideally to save the correction in a database somewhere so as to get it right the first time next time that file is opened in that program.

(For a viewer like Visual Studio Code, this is relatively straightforward: just include an option in the viewer interface to change how the file is displayed, as VSCode in fact already does. For a command-line tool which takes in multiple input files, things might be a bit trickier—it’s hard to do per-file metadata-annotation when globbing up all files in a folder containing a bunch of them, for instance—but, at worst, you can always take as an extra input a user-created file identifying format for each of the other input files.)

2. Traditional Metadata

Some programs try to use a file’s name in order to figure out traditional metadata-type information such as author/artist, title, et cetera. This is an eminently reasonable thing to do, in the absence of file-internal metadata of the sort one can find in EPUBs or in most music formats or suchlike. But it shouldn’t be mandatory that users change filenames in order to change metadata.

Instead, much as in the ambiguous-file-format case in the prior section, use the filename as a source of initial defaults but allow users to make corrections and save those corrections somewhere such that they won’t need to be re-made every time the users reopen the files. If you’re working with a file with a robust internal metadata system, maybe save the corrected metadata straight into the file; if not, or if you’d rather interact with your users’ files on a read-only basis, save the correction in some database local to your program’s files.

3. Remote Database Matching

Some of the worst offenders I’ve encountered, in terms of placing egregious demands on their users file-naming-wise, are the various “run your own video streaming server” programs such as Plex, Emby, and Jellyfin. In theory, they let you match your movie and TV video files against various online databases in order to get metadata and poster art and so forth for each file and thus get a properly slick Netflix-esque streaming experience, with your shows sorted into seasons and your bonus features neatly listed next to the actual movies and shows and so forth. In practice, they let you do that if and only if you’ve named your files according to a specific rigid set of criteria; otherwise, they instead will become very confused, mislabel some files (my Plex server is convinced that the first twelve episodes of Carmilla Season 3 are actually the twelve episodes of Season 0 and has labeled them accordingly), completely ignore others (all of my bonus features, for example), and completely refuse to be corrected in these errors.

(And if you want to resolve this problem, well, here’s this third-party $6/year subscription service which will automatically rename your files to the ‘correct’ naming scheme for you! Isn’t that so convenient? Why would you ever want to do something like correcting the program’s errors directly when you could pay to correct your filenames instead? (No, I’m not joking.))

The thing to do in order to avoid making a complete ridiculous mess of things like those programs do is to allow users to fix mismatches and point out missed files manually. Let me go “here are the missing bonus features, and here are the actual identities of those episodes you mislabeled”, in the Plex case. This is, ultimately, the same basic fix as the prior case, even if the details are different.

Conclusion

Perhaps the prior section was a bit repetitive; hopefully it was at least illustrative in that repetitiveness. As a general rule, filenames might be a useful heuristic by which to get started in one’s database-matches, but they shouldn’t be the last word, in cases where they’re leading to incorrect analyses. Programs should be built accordingly, in order to avoid disrupting users’ file-system structures in the manner discussed above.

This post’s argument is, ultimately, relatively limited in its scope. I refrained from going into the security issues that can potentially result from treating filenames as a trustworthy metadata-source, or into the upsides and downsides of the “make a copy and enforce filename-as-metadata only on the copy” approach, or suchlike. But, even purely from the user-convenience angle, there’s a strong case to be made, and I hope I’ve illustrated that case reasonably well over this post’s course.

Tags: UX