Alyssa's Coding Journal

About


EPUB 2.0.1 Structure: A Simplified Overview

Author: Alyssa Riceman

Posted:

Updated:


Introduction

This is an overview of the internal structure of EPUB 2.0.1 files. (Which I’ll henceforth call just ‘EPUB’, no version number specified, for the sake of conciseness.) My goal in writing this is to provide a useful resource for programmers who want to create programs which generate well-formed EPUB files; I intend to summarize all essential information about the format’s internal structures for that purpose, while hopefully being briefer and less intimidating to read than the format’s official specifications (archive 1, archive 2, archive 3).

While I will be going into a fair amount of detail, this is an overview, not a fully detailed exposition. In particular, if you’re designing an EPUB reader, I’d recommend looking at the official specifications instead; my summary will convey the information necessary to write well-formed EPUB files, but the format has various optional frills—deprecated features, optional but not-typically-used file format support, and so forth—which a writer doesn’t need to know how to generate but which a reader does need to know how to parse, and my summary won’t cover all of those.

Basic Structure

At its highest level, an EPUB file is a ZIP file. Specifically, it’s a ZIP file with the following broad structure:

[Zip file root]/
    mimetype
    META-INF/
        container.xml
        [Optionally some other metadata files]
    [The part with the actual book content]

The other metadata files are pretty noncentral to the EPUB format; I’ll discuss them briefly, later, but not in great depth.

The part with the actual book content isn’t rigidly defined in terms of folder/file structure, but it consists of an OPF file describing the overall shape of the book content, an NCX file serving as the table of contents, plus the book content being summarized. A relatively conventional structure has the metadata files and the book content in a folder titled OEBPS, and the OPF and NCX files respectively named content.opf and toc.ncx, leading to this overall structure:

[Zip file root]/
    mimetype
    META-INF/
        container.xml
    OEBPS/
        content.opf
        toc.ncx
        [The various files making up the book, including optional subfolder structure for e.g. separating out text and images and so forth]

XML files stored within the ZIP file should all be well-formed XML 1.0 files. When they include references to one another or to other files in the ZIP, the references should always be relative, rather than absolute.

The ZIP file housing all of this has to meet a few broad format criteria in order to be a valid EPUB:

…as well as having a properly placed mimetype file. Which brings us to:

mimetype

The mimetype file needs to be the first file in the ZIP archive’s linear file-order, and it needs to be uncompressed, unencrypted, and not contain any extra fields within its ZIP header. Its contents should be the ASCII string:

application/epub+zip

Taking all of these requirements together, the mimetype serves as an easy way for readers to check whether the ZIP file they’re looking at is an EPUB file. However, it’s somewhat inconvenient for whoever is building the file, since normal use of zip tools doesn’t involve manually specifying the order in which files should be placed in the ZIP or whether they should be compressed. Make sure, when zipping the epub, that you use a tool which supports those bits of functionality and that the resulting file has its mimetype in the right place.

(If you did it right, then, looking at the file in a hex editor, you should see the string mimetypeapplication/epub+zip starting at offset 30.)

META-INF/container.xml

The container.xml file is an XML file letting the reader know where the main files are that describe the rest of the book. A minimal container file looks like this:

<?xml version="1.0"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
    <rootfiles>
        <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml" />
    </rootfiles>
</container>

The XML version declaration, container root element with the displayed version and xmlns attribute values, and rootfiles element are mandatory. The rootfiles element can contain one or more rootfile elements, each of which needs a full-path attribute and a media-type attribute. There shouldn’t be any elements or attributes besides those.

Of the rootfile elements, at least one—and, ideally, only one—should have a media type of application/oebps-package+xml. That one (or the first of them, if there’s more than one) is the core EPUB rootfile. But there can be other rootfiles besides that one, specifying different renditions of the text; for instance, you can point to an EPUB rootfile for ordinary reading use plus a PDF rendition of the book for use by printers. Whether your reader will do anything with non-core rootfiles is a different question, and the answer is probably “no”, so you mostly shouldn’t worry about this and should just define the one core rootfile.

Within each rootfile element, the full-path attribute should define a path from the ZIP file’s root (NOT from the location of container.xml) to the file defining one rendition of the text. (It can be either a standalone file, as with the PDF example above, or a file which points in turn to a bunch of others, as with the OPF file (whose structure will be discussed in greater depth below).) The media-type attribute should list the media type for that file.

Other Files in META-INF

There are several other files which can optionally be included in META-INF, per the EPUB specification. These are:

None of these files are essential to an EPUB book, and most books will have no need of them. But, to briefly summarize what they’re each for:

manifest.xml is an OpenDocument Manifest file meeting this schema. It’s never made clear, within the EPUB specification, what it’s supposed to be useful for.

metadata.xml is very vaguely defined, but it’s an XML file, with all its elements namespaced, and it’s supposed to be used to hold some sort of metadata about the overall EPUB file. (Not about the core OPF book content; the OPF file has its own metadata section. This is for metadata for the overall EPUB file, on a higher level than just the OPF.) In practice readers will generally ignore this, given its lack of standardization, so you’re unlikely to have much use for it.

signatures.xml is an XML file listing digital signatures for files in the EPUB, for use if you want to sign your EPUB’s files. See Section 3.5.4 of the OCF specification for details, if you want them.

encryption.xml is an XML file listing off encryption information for files in the EPUB, for use if you want to encrypt your EPUB’s files. See Section 3.5.5 of the OCF specification for details, if you want them.

rights.xml is very vaguely defined, but it has to be a well-formed XML file, and it’s supposed to be used to list DRM-related information.

In all of these files, as in container.xml, any paths you might want to include should be relative to the ZIP file’s root rather than to META-INF.

Any files in META-INF other than these five and container.xml will be ignored by readers, so there’s no point putting them in.

The OPF File

The OPF file serves as the core file defining the structure of the EPUB. It should have a .opf extension. Technically, you can string together multiple XML files into a single OPF publication, with one (the main point of entry) getting the .opf extension and the rest getting ordinary .xml extensions; but, in practice, this is rarely helpful and you’re better off sticking with just a single file.

The broad structure of an OPF file is as follows:

<?xml version="1.0"?>
<package version="2.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="AN_APPROPRIATE_ID">
    <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
        [Metadata content]
    </metadata>
    <manifest>
        [Manifest content]
    </manifest>
    <spine toc="ANOTHER_APPROPRIATE_ID">
        [Spine content]
    </spine>
    <guide>
        [Guide content]
    </guide>
</package>

The XML version, package element with the displayed version and xmlns attribute values and with a defined unique-identifier attribute, metadata element, manifest element, and spine element with a defined toc attribute are mandatory. The guide element is optional, but is useful often enough that in practice you’ll still usually end up including it.

Note that, while the package element needs a unique-identifier attribute and the spine element needs a toc attribute, those attributes’ values can be arbitrary strings; they don’t need to be exactly as shown in the overview here, and in fact they usually won’t. (Within the official EPUB specification, the given examples show their values as, respectively, "BookId" and "ncx".)

Each of the metadata, manifest, spine, and guide elements have a bunch of internal content which I skipped over in that structural overview for the sake of clarity. I will now go into each in turn.

(I won’t go over the tours element, which is listed in the OPF format specification and which readers support, but which is deprecated such that you probably shouldn’t use it.)

Metadata

The metadata element lists metadata for the book. The broad structure of the metadata element is as follows:

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:title>[The title of this book]</dc:title>
    <dc:identifier id="AN_APPROPRIATE_ID" opf:scheme="UUID">[IMAGINE A UUID HERE]</dc:identifier>
    <dc:language>[An RFC 3066/ISO639 language code; if you're writing in English, probably 'en']</dc:language>
    [Optionally a bunch of other metadata elements (which can also go before or interspersed with the above three, there are no hard limits on ordering)]
</metadata>

The metadata element, the dc:title element, the dc:identifier element with a defined id attribute, and the dc:language element are mandatory. The xmlns attributes on the metadata element, and the opf:scheme attribute on the dc:identifier element, are optional but highly recommended. (More detail on the opf:scheme attribute momentarily.)

There are fifteen metadata elements, standardized by Dublin Core, which serve as the primary basis of the EPUB metadata structure. With the exception of the previously-noted dc:title, dc:identifier, and dc:language elements, you can have zero or more of each. You need at least one of each of those three. To briefly summarize each DC element:

All of these with the exceptions of dc:identifier, dc:language, dc:date, dc:type, and dc:format can be optionally tagged with an xml:lang attribute, whose value is set to a language code of the same sort used in dc:language, in order to note that bit of metadata as being in a specific language. There’s no standardized implementation for what this means, but if your reader is smart enough to try to display content based on language settings, it lets you (for example) have multiple dc:title tags in different languages and let the reader pick the appropriate display title for the user’s language settings.

In practice, you won’t need to use most of these, especially the later ones, and most readers won’t display most of the later ones anywhere convenient. dc:creator, dc:subject, dc:description, dc:publisher, and the mandatory ones are all pretty useful; the rest are mostly skippable.

If you want to tag a bit of metadata that doesn’t fit neatly into any of the DC categories, you can do so via a meta tag with name and content attributes. For instance, the element <meta name="translated_from" content="ja"> would indicate that your book’s translated-from metadata value is ja—or, in human terms, that the book is translated from Japanese—where none of the DC elements would neatly support listing that information. (Of course, you’re at the mercy of your reader when it comes to displaying these nonstandard metadata elements, so in practice it might not help very much.)

Manifest

The manifest element lists the files making up your book. (Where, by ‘your book’, I mean the main EPUB body of your book, the part that the OPF file is describing. The manifest isn’t going to list your mimetype file or other higher-level structural elements, or your OPF file itself; but it lists everything below the OPF in the book structure.)

The broad structure of the manifest element is as follows:

<manifest>
    <item id="a_manifest_item" href="example.html" media-type="application/xhtml+xml" />
    <item id="another_manifest_item" href="example2.xml" media-type="application/xhtml+xml" />
    <item id="ANOTHER_APPROPRIATE_ID" href="toc.ncx" media-type="application/x-dtbncx+xml" />
    [Probably a bunch more items, for most books, although all you strictly need are the NCX and one file to go into the spine]
</manifest>

The manifest element is mandatory. It contains a list of item elements, each of which needs an id attribute (which should be unique among id attributes within the overall OPF file), an href attribute (pointing to a file, with path relative to the location of the OPF file in which the item is defined, and no fragment identifiers in the path), and a media-type attribute appropriate to the pointed-to file’s type.

There should be one item listing for each file in the book. The order in which the items are listed doesn’t matter, but no file should have more than one href pointing to it within the manifest.

Anything listed in the manifest needs to be contained within the EPUB’s ZIP archive; conversely, any files in the archive which aren’t listed in the manifest shouldn’t be referenced from anywhere in the files that are.

There’s a relatively limited list of file types assumed to be natively supported by all EPUB readers, which the readers are guaranteed to know how to deal with or fall back from. Those file types are:

(Any of these which contain text, rather than binary, data need to have the text encoded as either UTF-8 or UTF-16, incidentally.)

However, you can include files of formats other than these in the manifest. To do so, you need to define a fallback chain for that file, ending in a file of a supported format, so that readers which don’t know how to handle the unsupported formats can still display something when display of an unsupported file is called for by the book flow.

To mark the fallback option for an item in the manifest, set its fallback attribute’s value to the value of the id of the item it’s supposed to fall back on if needed. You can chain these together, so that item 1 falls back on item 2, item 2 falls back on item 3, and so forth. You can’t make the fallback chains loop, though; they need to terminate eventually, and they need to terminate on files of supported types.

For XML files which aren’t SVG, XHTML, DTBook, or NCX, the rules are a bit different. For an item pointing to one of those files, you need to give it a required-namespace attribute, whose value should be the XML namespace used by that file. If any modules are required to render the XML properly, not default to the specified namespace, then you also need to give it a required-modules attribute whose content is a comma-separated list of the names of those modules (with spaces, if present, replaced with hyphens). You also have the option to, instead of or in addition to giving it a fallback attribute, give it a fallback-style attribute whose value is the id of a stylesheet which can be used to render the XML if the reader has no native knowledge of how to do so. (I don’t know how stylesheet-based XML rendering works, and thus can’t with any confidence recommend ever doing it; but the option nonetheless exists and seems worth noting, for the sake of completeness.)

Spine

The spine element lists a linear reading order for the XML files (XHTML, DTBook XML, or unsupported XML with appropriate fallbacks in place) which make up your core book content. The broad structure of the spine is as follows:

<spine toc="ANOTHER_APPROPRIATE_ID">
    <itemref idref="a_manifest_item" />
    <itemref linear="no" idref="another_manifest_item" />
    [Probably a bunch more itemrefs, for most books, although all you strictly need is a single itemref not marked as nonlinear]
</spine>

The spine element with a defined toc attribute is mandatory. The toc attribute’s value needs to be identical to the value of the id of the manifest entry for an NCX file, which will serve as the book’s table of contents. The spine needs to contain one or more itemref elements, listed in the order in which they should appear in the book; each one needs an idref attribute whose value is identical to the id of the manifest entry for one of the XML files which comprise the book’s content. The linear attribute is optional; more on it momentarily.

A given idref shouldn’t appear in the spine more than once. Any XML in the manifest which is reachable by the readers—via the table of contents, via the guide, via hyperlink from a file in the spine or reachable via one of the aforementioned methods, et cetera—has to be listed in the spine.

If an item in the spine should be outside of the main reading flow—a footnote, for example, which is linked but which readers probably don’t want to have to page through without following the link—it can be marked with the attribute linear="no"; not all readers respect this, but many do, and will take non-linear spine members out of the main reading flow and only show them when the reader follows a link to them. But the spine has to contain at least one linear element; you can’t have an entire spine full of non-linear itemrefs. If an item is marked as non-linear, there should be some sort of reference to it (hyperlink, TOC, et cetera) so that you can be confident readers will have a way to get to it at all.

Guide

The guide element lists links to various significant parts of your book. Unlike the metadata, manifest, and spine elements, it’s optional, and you don’t need to include it. But you can, if you want.

The broad structure of the guide is as follows:

<guide>
    <reference type="text" title="Main text" href="example.html#start" />
    [Optionally a bunch more references]
</guide>

If the guide exists, it needs to contain one or more reference elements, each of which needs a type attribute and an href attribute with defined values. The title attribute is optional, and the specific values I used here for all three elements are just examples, not mandatory.

The list of valid types is as follows:

If you want to include a type other than these ones, you can use arbitrary type names as long as they’re prefaced with other.. So you could, for example, have a reference element whose type is other.tldr to point at the TL;DR summary of your book. (Usual disclaimers apply, readers will plausibly ignore this.)

The EPUB format specification makes no mention of what happens if you include multiple guide elements of the same type. Undefined behavior is scary, so I’d recommend avoiding doing so.

The href points at the file being referenced. The cover image for the cover reference, the foreword for the foreword reference, and so forth. Unlike the href elements in the manifest, you are allowed (although not required) to use fragment identifiers in the guide’s href elements, as shown in the example above.

The title element is mentioned nowhere in the EPUB format specification except in its example code. (I have the impression the IDPF put less effort into defining the guide than the other parts, probably on account of how it’s not mandatory like the other parts are.) Nonetheless it seems to be standard to include it, and to have it be a simple human-readable description of the type.

The NCX File

The NCX file serves as the EPUB’s table of contents. As mentioned in the previous section, it has to be listed in the OPF’s manifest, and be identified in the OPF’s spine.

The NCX format is unusual in that it was originally specified for the DTBook format, rather than either for EPUB or for general web use. If you want to read the specification, it’s the eighth section of the specification here (archive). The NCX version used in the EPUB format is slightly different from the original DTBook version, but not substantially so.

A minimal NCX file would look like this:

<?xml version="1.0"?>
<ncx version="2005-1" xmlns="http://www.daisy.org/z3986/2005/ncx/">
    <head>
        <meta name="dtb:uid" content="[IMAGINE A UUID HERE]" />
    </head>
    <docTitle>
        <text>[The title of this book]</text>
    </docTitle>
    [Optionally a docAuthor here]
    <navMap>
        <navPoint>
            <navLabel>
                <text>The first TOC entry!</text>
            </navLabel>
            <content src="example.html">
            [Optionally one or more navPoints nested under this one]
        </navPoint>
        [Optionally more navPoints]
    </navMap>
    [Optionally a pageList here]
    [Optionally a navList here]
</ncx>

…which is kind of a lot! But necessary, so let’s summarize it.

The XML version declaration is mandatory. In theory, you could put a doctype declaration under it; in practice, I actively recommend not doing so, for reasons I’ll get into shortly.

The root element is the ncx. It needs version and xmlns attributes with the values shown in that example. It contains:

As I mentioned, you probably want to not include a doctype declaration for the NCX. This is because, in the original NCX specification, it was mandatory to include a playOrder attribute in all navPoint, pageList, and navTarget elements, defining (in integer order, starting with 1) linear order through the book. The EPUB specification allows skipping this, since it’s redundant with the OPF’s spine and is kind of inconvenient; but compliance with the NCX DTD requires inclusion of the playOrder elements, and the EPUB specification does require that you comply with the DTD if you include it. Thus, better not to include it.

The meta Elements

There are four natively-supported metadata names in an NCX file. They are:

If you want to include more for some reason, you can include whatever other metadata names you want, albeit without the dtb: prefix (which is reserved for those four). But in practice there’s not much reason to, since your real metadata-hub is the OPF, not the NCX.

(In the original NCX specification, all four dtb: meta names had to have defined values. But the EPUB version of NCX strips that requirement out, fortunately.)

The navMap Element

As mentioned, a navMap contains zero or more navInfo elements, zero or more navLabel elements, and one or more navPoint elements.

navInfo elements contain text elements which contain comments on their parent elements. You mostly don’t need to worry about them. navLabel elements will be important momentarily, but are ignorable within the immediate context of the navMap. So mostly the bit that matters is the part with the one or more navPoints.

Each navPoint represents an entry in your table of contents. A navPoint element contains one or more navLabel elements, a content element, and zero or more additional navPoint elements, allowing for arbitrary nesting. (This allows for nested tables of contents. I can have a navPoint for Chapter 1 and then additional navPoints nested inside that one for each individual section of Chapter 1, for example.)

Putting nesting aside: the navLabel elements contain text elements which contain whatever text you want the table of contents to associate with a given link. (<text>Chapter 1</text>, for instance, if your navPoint is pointed at Chapter 1.) The content element has no contents, but has a src attribute containing a path (relative to the NCX file) to the XML item it’s referencing. (And it should be an XML item. One listed in the OPF file’s spine, specifically.) Unlike the OPF manifest but like the OPF guide, fragment identifiers are allowed in the path.

The reason you can have multiple navLabel elements in a given navPoint is that they can, optionally, be given xml:lang tags, same as are used in various metadata elements in the OPF, so as to offer the label in different languages. Also like the metadata elements in the OPF: there’s a good chance your reader won’t be smart enough to do anything beyond using the first one, so this option is a lot more useful in theory than in practice.

The pageList Element

Contains a list of numbered pages. As mentioned, this is optional and you probably don’t want to include one of these. If you do, though, it’s relatively similar to the navMap. Like the navMap, it can have navInfo and navLabel elements but you have very little reason to give it either. Instead of having one or more navPoint elements, though, it has one or more pageTarget elements.

Each pageTarget element needs an id attribute, unique within the NCX file. It also needs a type attribute, whose value can be either "front" (for roman-numeral-numbered pages in the front of the book), "normal" (for normal Arabic-numeral-numbered pages in the main book body), or "special" (for other pages). It also should have—not strictly mandatory, but highly recommended—a value attribute, containing an integer representation of the page number being targeted. Then it needs one or more navLabel elements and a content element, same as in a navPoint except minus the possibility of nesting.

The navList Element

Very much like a navMap, except instead of containing nestable navPoint elements it contains non-nestable navTarget elements which only include navLabel and content attributes, no option of additional navTargets. Also, the navList itself requires a navLabel, in order to label what it’s a list of.

These are designed for use with things like lists of illustrations, which might be worth listing but which don’t belong in the main table of contents. Probably most of the time you don’t need to bother with these, but I’m including them here anyway for the sake of completeness.

Miscellaneous Notes on File Formats

So. That’s all of the book’s key structural elements summarized. But then you also need the pieces actually making up the book.

In practice, your book is probably going to consist mostly of GIF, JPEG, PNG, and SVG images, XHTML text, CSS stylesheets, and the one NCX file. Also officially supported are DTBook XML, XML 1.0 (in the marginal “must include fallback information” way described in my summary of the OPF manifest), and the deprecated OEBPS 1.2 document and stylesheet formats and XML 1.1 format; but, in practice, you’re probably not going to use those much.

(There are various formats treated as supported by the manifest which aren’t on the above list; that’s an eccentricity of the manifest. When I talk about ‘supported formats’ from here on out, I mean the twelve formats (if we count XML 1.0 and 1.1 separately) listed in the previous paragraph.)

So here’s a list of oddities in how EPUB files handle the aforementioned seven relatively-common formats:

Inline non-supported XML

As mentioned above, it’s possible to inline non-supported XML formats within your XHTML documents. An instance of this might look like:

<ops:switch>
    <ops:case required-namespace="http://www.w3.org/1998/Math/MathML">
        [MathML representing the equation "2 + 2 = 4"]
    </ops:case>
    [More cases, if you've got more unsupported XML formats you might want to fall back on]
    <ops:default>
        <p>2 + 2 = 4</p>
    </ops:default>
</ops:switch>

The ops:switch element is the root of the inline non-supported XML block. It contains zero or more ops:case elements, each with a required-namespace attribute and (if applicable) required-modules attribute of the same sorts previously described in my discussion of fallbacks for unsupported XML files in the manifest. Each ops:case element should then contain XML of whatever unsupported sort is ientified by those attributes. The ops:switch also contains a single ops:default element, whose contents need to be well-formed XHTML.

The fallback chain is then implemented as: the reader looks at the opf:case elements in the order in which they’re defined. If it hits one whose namespace it knows how to display—if the reader supports MathML, in this example—then it displays the contents of that case. (So the first case it knows how to display is the one that gets displayed, in other words.) If it doesn’t know how to display any of the cases, then it displays the contents of the opf:default instead, as the final definitely-supported fallback.

Conclusion

…and that’s it! Writing this, I tried to make it the sort of summary which I wish I had been able to find when I was doing my own research on the EPUB format a few weeks back. At that, I think I succeeded, and I hope this post will prove as useful to others as it would have been to my past self.


Tags: EPUB