E-Book Enlightenment

Making DJVU's

enlightenment.png

Introduction

Writing this book has been a real education for me, and I learned a few things I did not expect to learn.  The most surprising thing I learned is that DjVu does not always give a smaller file size than PDF!  Since the only reason to prefer DjVu to PDF is to get a smaller file that uses less memory, it is important to understand when PDF will give the smaller file size.  Making a DjVu is more work than making a PDF, so you need to know when it is a waste of your time.

In the chapter on creating book scans, I talk about two methods of doing them.  The first method (entitled "The Road Less Travelled") preserves the look of the original page, including the color of the paper, the margins used, etc.  The second method (entitled "The Easier Road: Scan Tailor") looks for pages with nothing but text and makes these pages have pure black letters on a pure white background.

If you do the first method, DjVu can help give you smaller file sizes.  Here is a comparison:

-rw-rw-r--. 1 jim jim  87606063 2010-05-15 14:08 BoysAviationJPGs.djvu
-rw-rw-r--. 1 jim jim 182866779 2010-05-15 16:36 BoysAviationJPGs.pdf

This is a Linux directory listing showing a PDF of a book made with the method that preserves the look of the original pages.  The .djvu file is less than half as large as the PDF.  Now let's look at files created with the Scan Tailor method, which preserves the content of the pages but changes their look:

-rw-rw-r--. 1 jim jim 121069444 2010-05-15 13:14 BoysAviationScanTailor.djvu
-rw-rw-r--. 1 jim jim  56796427 2010-05-15 13:11 BoysAviationScanTailor.pdf

A couple of surprising things here.  The .djvu file is considerably larger than the PDF (but still smaller than the other PDF).  What's really surprising is that the PDF made using the Scan Tailor method is the smallest file of the four, by a significant amount.

How to explain this?  Compression looks for redundant information and replaces the raw information with a description of that information.  In "lossy" encoding schemes compression looks for information that would not be missed and discards it to make the file smaller.  When you have pages with pure black text on pure white backgrounds that are already compressed, an attempt to compress such a file even further might make the file larger than it was to begin with.

On the other hand, a book that has lots of illustrations may produce a larger file using Scan Tailor than using the other method.  The third book I scanned had illustrations on almost every page, mixed in with the text.  Because Scan Tailor could not save such pages as pure black and white images the resulting PDF was twice the size of the version made the other way.  (It must be said that Scan Tailor did a beautiful job of laying out the pages.  Smaller file sizes are not the only reason to use Scan Tailor).

If this explanation doesn't make sense to you, just remember that if you use the Scan Tailor method of preparing your page images and your book has only a few illustrations don't bother with making a DjVu file.  A PDF will do just fine.

If you resize and compress pages not created with Scan Tailor to create a PDF you can still get a smaller file using DjVu.  Here is an example:

-rw-rw-r--. 1 jim jim 49519200 2010-05-30 08:25 ArabianNights.djvu
-rw-rw-r--. 1 jim jim 69192729 2010-05-30 07:29 ArabianNights.pdf

The DjVu version is 20 megabytes smaller.

DjVu Libre

To make DjVu files you need to install DjVu Libre.  This software comes with every Linux distribution.  Users of Windows and Macintosh may download their versions here:

http://djvu.sourceforge.net/index.html 

There are two command line programs in this package we need to use.  The first is named c44, and it's job is to convert our .jpg files into .djvu files with improved compression.  You can run it on a single file like this:

c44 filename.jpg

Regrettably there is no way to run c44 on a group of JPEG's; each invocation of the program converts just one file.  Fortunately, there is a way to run c44 on every JPEG in a directory without typing in the command over and over.  You can use a simple Python program like this one, which should be put in a file named makedjvus.py:

#! /usr/bin/env python
import glob
import getopt
import sys
import subprocess

def make_djvus(filename):
    """This function is called
    for each image file."""

    subprocess.call(["c44", filename])
    print 'filename', filename
    return

if __name__ == "__main__":
    try:
        opts, args = getopt.getopt(sys.argv[1:], "")
        if len(args) == 1:
            print 'using glob'
            args = glob.glob(args[0])
            args.sort()
        i = 0
        while i < len(args):
            make_djvus(args[i])
            i = i + 1
    except getopt.error, msg:
        print msg
        print "This program has no options"
        sys.exit(2)

When you have this installed on your system, run it like this:

python makedjvus.py *.jpg

The program should be in your system PATH and your current directory should be the one with the JPEG's to convert.

When you have all the files converted it's time to use the second command line program, djvm,  to combine the .djvu's into a complete document, also with the suffix .djvu:

djvm -c BookTitle.djvu *.djvu

The -c option specifies the document file to create and everything after that is file names to include in the document.

Here is my .djvu file, being viewed with DJView3 in Linux:

Boys Aviation DjVu