Donating Texts To Project Gutenberg
If I was to sum up my impressions of the Internet Archive versus Project Gutenberg I would have to say that the Internet Archive focuses more on preserving as many books as possible, whereas Project Gutenberg is more focused on quality. Not only does PG require extensive proofreading for their texts, they also require a copyright clearance before they will even consider accepting a book. It is much less work to create a submission to the Internet Archive than it is to submit a text to Project Gutenberg. If you wanted to donate the same book to both organizations when your Image Container PDF is ready to submit to the Internet Archive the work of creating a text for Project Gutenberg would have only just begun.
Copyright Clearance
The first thing you need to do to create a submission to Project Gutenberg is to get a copyright clearance for the book by submitting a TP & V to the website using a form on the site. TP & V refers to the Title Page and Verso of your book. You'll need to scan both and submit the image files. Either one or the other should show a publication date before 1923 (unless you think you can get a Rule 6 clearance because the copyright was not renewed). Here is a title page for a book that I bought recently:
Here is the Verso, which is just the back of the Title Page. Both give a publication date of 1916.
I bought this book in case my first submission was rejected. Happily it wasn't, so I am able to avoid proofreading a geography textbook from 1916. The text I did submit is an English translation of the Pierre Louys novel Ancient Manners published in Paris as a limited edition for subscribers in 1906. Project Gutenberg already had the novel in French under the title Aphrodite, but did not yet have an English translation.
My TP & V for Ancient Manners did not show a publication date, but the Open Library website gave a publication date of 1906. The woman who processed my request told me that the Open Library website was not a good enough authority for this purpose, but she had checked the Library of Congress website and had concluded that 1906 was a plausible publication date for the book.
When you submit a book to PG that was published after 1922 but did not have its copyright renewed you must follow the procedures in the Rule 6 HOWTO:
http://copy.pglaf.org/rule6-new.txt
If your submission is cleared, the email you get from Project Gutenberg should look like this:
Project Gutenberg copyright clearance submission automated status update notification. Title: Ancient Manners Author1: Pierre Louys Status: Cleared OK Clearance OK key=20100611081931louys
You're going to need to use the clearance key (shown in bold above) to submit your book.
The next step is to do OCR on the page images you created for your books, which will create a separate text file for each page. If you want to do all the work yourself use the proofer.py utility from the chapter on creating text files to proof one page at a time before combining them. Format the text page as best you can and run gutcheck, jeebies, and a spell checker (like the one built into guiguts) on the text. Make the file as error-free as you can before submitting it to PG. PG has volunteers known as whitewashers (after the best known passage in Tom Sawyer) who check over submitted texts. The first thing they'll do is run gutcheck, jeebies, and gutspell on your file. If these utilities show lots of errors they won't even look at your submission, so be considerate of their time and run these tests yourself.
Creating a submission to PG can be a lot of work, but if you want the donation to happen as quickly as possible its the only way. If you aren't in a hurry you can donate page images to Distributed Proofreaders instead.
Distributed Proofreaders
To meet the standards of Project Gutenberg a Plain Text file will need a lot of proofreading, preferably by more than one person. Distributed Proofreaders is a website where hundreds of volunteers proofread and correct individual text pages by comparing the text to an image file showing the page it corresponds to. There are several "rounds" of proofreading, and when those are finished a Project Manager combines the individual pages, does some final checks, and adds the current Project Gutenberg license text. It may be offered to DP volunteers for "Smooth Reading", where the volunteer reads the book for pleasure and identifies any problems he spots. It then gets submitted to Project Gutenberg. The Distributed Proofreaders site is at:
You don't need to submit your book to DP to get the book submitted to Project Gutenberg, but I think it's a good idea. As a computer programmer I know all too well that it is difficult to find flaws in your own work, and much easier to spot flaws in the work of others. As a practical matter, it isn't really necessary to remove the beam from your own eye before you look for motes in other people's eyes. If we all check each other's eyes everything will ultimately get cleaned out.
To submit your work to DP you'll need a copyright clearance from Project Gutenberg first. When you get that contact the DP website using the email address from the Content Provider's FAQ:
http://www.pgdp.net/c/faq/cp.php
You need to let them know your intention to submit a text for proofreading. Provide the copyright clearance information you got from Project Gutenberg in the email.
Once you have that, prepare individual text files corresponding to the page images in your book. DP wants blank pages to be included in the page images, starting from the inside of the front book cover and continuing to the inside of the back cover. The text files should have MS-DOS style line endings, which means that every line of text ends with two characters, called a carriage return and a line feed, not just a line feed. You can verify that your text files have the correct line endings by trying to edit them with Notepad in Windows. If the file doesn't have the right kind of line ending Notepad will show the file as having no line endings at all and funny characters where the line endings should be. If you find yourself with text files like this, one way to fix them is to load them into Word Pad and save them back out again. Word Pad can handle Unix-style line endings and will change them to MS-DOS style when you save.
If you're using Linux a less labor-intensive way of doing the line ending conversion is to use the unix2dos command:
unix2dos *.txt
The page images should be in PNG or JPEG format and should be 1000 pixels wide. You can convert TIFFs to JPEG's using Image Magick's mogrify command like this:
mogrify -resize 1000 -monochrome -format jpeg *.tif
Both images and text files should be named as three digit numbers followed by the suffix. If you use the guiprep utility mentioned in the chapter on creating Plain Text files it will do the renaming of the files for you, and will run a program pngcrush which will reduce the disk space required for your PNG files without affecting the quality of the image.
While guiprep assumes you will be submitting PNG's, JPEG's are also acceptable and will take much less disk space. There is no real advantage to using PNG's for your submission to Distributed Proofreaders.
If your book is illustrated DP will ask you to provide high quality JPEGs of the illustrations, named to correspond to the page they appear on. These illustrations may be used to create an HTML version of the book. It is likely that you will be asked to use a flatbed scanner to create these illustrations, to avoid problems like inadequate lighting and keystoning. They will ask for color scans, even if the original is grayscale, because they want the color of the paper to come through. You should also not crop the images yourself. Leave the cropping to the person who will create the HTML version of the book.
When you have all this the text files will go into a directory named text, the page images can go in a directory named pages, and the illustrations go in a third directory which you can name illustrations or something similar. When you have these directories created you need to put them all in a Zip file named
DPusername_ShortTitle.zip
where DPusername is the account you have on the DP site and ShortTitle is a shortened version of the book title with no spaces. You will also need to prepare a separate text file named
DPusername_ShortTitle_README.txt
which will contain notes on the book. For my own submission to Distributed Proofreaders I included the following information:
- The copyright clearance number I was given by Project Gutenberg. This is the most important piece of information in the README and must be provided.
- The pages of the original book are not correctly numbered. There are many full page illustrations and after each one the page numbering skips a page or two. I have verified that no pages are missing from my submission.
- There are several places in the text where Greek words are used in the original alphabet (as in the illustration above). Having these words rendered in the Roman alphabet or simply replaced with "(Greek words)" should not hurt the book much.
- I have made an attempt to hand correct most of the text files before submitting them.
Until recently the next step would be to use an FTP client to copy these files to a directory on DP's FTP server. Just before I was ready to submit my own work DP shut down their FTP server because of problems with computer viruses. As of this writing an ordinary Content Provider cannot upload content. You need to be a Project Manager, a privilege only granted to those who have proofread hundreds of pages. The current method to get your content uploaded is to place it on a server where a Project Manager can download it. One method popular with DP users is called DropBox:
This gives you a free website where you can post files for downloading by others. After that have your files on DropBox, go the wiki page at:
http://www.pgdp.net/wiki/
and start a new section for yourself and list your project. There is a special Wiki template you'll be required to copy and fill in.
When I did this, someone on the site suggested making a posting to the "Books I'd Like To See In PG" Forum topic describing my book to potential Project Managers. Proofreading books is not necessarily first-in, first-out. If your book sounds more interesting than the next one in the queue it might get a Project Manager sooner. I did this, and did manage to get a Project Manager interested. He warned me that it might still be over a year before the book made it to PG. I told him I was OK with that, and I was at the time. Less than a year later I decided to create the PG submission myself.
After you've done all that you might consider doing some proofreading of other people's books. Information on how to do that is on the site.
If your native language is not English or if the book you're submitting is not in English you'll want to work with Distributed Proofreaders Europe:
This is also the place to submit books that are meant for Project Gutenberg Australia.
If you have books in English or French where the author has been dead fifty years or more you could donate them to Project Gutenberg Canada. They, too, have their own Distributed Proofreaders site:
http://www.pgdpcanada.net/c/default.php
This is the place where I submitted my Robert C. Benchley collection. You also need a copyright clearance before you can submit a book to PG Canada, but it is generally easier to get because the only thing that needs to be verified is the author's death date. For my Benchley books this meant that the author's words, drawings, and stills from some comedy shorts he appeared in were all clearable1 . Regrettably, his illustrator Gluyas Williams outlived him by many years so those charming illustrations could not be included in my submissions.
Making A Web Page From A Plain Text File
If you're going to make a submission to Project Gutenberg directly (rather than going through Distributed Proofreaders) you'll need to create a Plain Text and an HTML version of your submission. The PG website has a pretty good article on how to convert your Plain Text file to a web page by hand. What they don't tell you, unfortunately, is that there is really no need to do that. Guiguts, the editor I recommended for creating your Plain Text file, has a dialog that will automatically insert HTML markup into your page. You'll find it under the Fixup menu, option HTML Fixup, and it looks like this:
It really is as simple as making a copy of your Plain Text file with a .html suffix, loading it into guiguts, and pressing the Autogenerate HTML button. When I did that with The Big Sleep it generated HTML that included a nice style sheet:
The resulting web page looks like this:
Notice that the style sheet justifies the text so both left and right margins are a straight line, just like so many printed books do. The first line of text automatically becomes the title and the second line is automatically the author.
While this is a great time saver, you'll still need to do two things by hand:
- Text that is bold or italicized in the original will need to have <i> and </i> tags added.
- Chapter headings will need to have <h2> and </h2> tags added.
Examples
As of now I have completed two submissions to Project Gutenberg, the first of which was Benchley Beside Himself for PG Canada. It went in on November 26, 2010 and you can check it out by looking on the site for books by Robert C. Benchley. The Big Sleep, my second submission to PG Canada, went in on January 11, 2011. The site announced the availability of The Big Sleep like this:
I have donated several other texts to Distributed Proofreaders and DP Canada. Ancient Manners went to DP, and DP Canada got two more Benchley books plus three more Raymond Chandler novels. There is a substantial queue of books waiting to be proofed ahead of these.
On my first submission to PG (Ancient Manners) I did two things right and everything else wrong. The two things I did right were:
- I chose a book published before 1923. Rule 6 clearances are very difficult to get. I have submitted several TP&V's to PG in hopes of getting Rule 6 clearance. None were cleared.
- I put the book on the Internet Archive. Granted, the book pages were dingy from insufficient lighting but having the book there did attract the interest of a Project Manager.
As for the things I did wrong:
- My original page images were poorly lit, and could not easily be converted into readable black and white PNGs.
- I did my own OCR. This is not necessarily a mistake, but since my original page images were of poor quality (due to the age of the book and inadequate lighting) it was difficult to convince anyone that ABBYY Fine Reader would not have done a better job on the OCR than Tesseract did. It is possible that it would have.
- I spent many hours correcting my text files before submitting them. Because of this my PM felt obligated to use them, whereas if I had just given him the scans he would have tossed out my text files and put the scans in the OCR Pool.
- I left out blank pages that were in the original book. I had also chosen a book where the page numbering was faulty, where missing page numbers did not always mean missing pages.
Eventually I scanned all the illustrations in color at 600 DPI using a flatbed scanner, and my PM was able to clean up my original JPEG's enough to make usable page images out of them. He also added blank pages to correspond to page numbers that did not exist.
My PM suggested the following:
"James, being the one who is working on your current project I strongly request you leave the OCR to the person that will be PMing the project.
"JPEG is the perfect format to send the person, make no changes to the photographed images. Text should be in 300 DPI color and illustrations need to be at least 600 DPI color. Do not save anything in B&W. Grayscale is ok for text pages but always use color for illustrations even if the illustration is in B&W.
"Include EVERY blank page from the first page of print to the last page of print. This is a DP requirement.
"... I ask (that) you do nothing more than scan the pages of the book in the future. I/the PM have tools that will make good use of the scans and create what is needed by DP. With a good set of scans I can do most the image work within an hour and have the text prepped and the project checked by the end of a second hour based on an average sized project. Special attention to illustrations is the only thing that takes longer."
The moral of this is to find out what your PM wants before you do the work. Creating page images and text files is not that difficult or time consuming, and if your originals are not of the best quality the PM may prefer to do this himself.
For my first submission to DP Canada I did better. I picked a book (Inside Benchley) where the author (Robert C. Benchley) had been dead more than 50 years. I used a flatbed scanner to scan the pages 2-up, 300 DPI in black and white, PNG format, then used Scan Tailor to create page images from them. I did OCR on the TIFF's that Scan Tailor had created and the text files came out needing very little correction. (The original book was in excellent condition). I made black and white PNG's 1000 pixels wide out of the TIFF's using Image Magick mogrify. I scanned pages with the illustrations that Benchley had done himself as 600 DPI color JPEG's.
My PM for DP Canada turned out to be a Benchley fan with no objection to Tesseract.
While this submission was more successful than Ancient Manners turned out to be, I can't just say to always use the approach I used with the Benchley book. Ancient Manners is too large to be scanned 2-up, and when you scan it as 300 DPI black and white PNGs you get pages that are not easily read by humans or OCR. Color scans would be fine, but would take days to do, and the book had some defective pages where the inner margin was so small as to make scanning impossible. Digital pictures with good lighting are what the book really needed. If I was doing everything over again I would create good digital pictures with bright and even lighting, use Scan Tailor to create page images with content in the original colors and white borders, and submit these as high quality JPEG's to the OCR pool. I would also create extra blank pages to make up for the missing page numbers. I would still do color scans at 600 DPI for the illustrations.
For Benchley Beside Himself, I decided to do the proofreading myself and create my own Plain Text and HTML files. This book is filled with short humorous articles that are easy and enjoyable to proofread.
My next submission to DP Canada was Chips Off The Old Benchley. For that one I did 2-up scans in greyscale on a flatbed scanner, used Scan Tailor to create pages with white borders, then used Image Magick to make JPEGs out of these. I submitted the JPEGs in a Zip file to DP Canada without doing any OCR on them. My PM was perfectly happy to do the OCR on these using ABBYY Fine Reader.
My latest submission to PG Canada was The Raymond Chandler Omnibus, which contains the first four Raymond Chandler novels. I gave all the page scans to DP Canada, but informed them that I would be doing The Big Sleep myself. When a book is that good, you don't need help to proofread it.
In the future I will likely only do OCR on books I intend to submit directly to PG.
This brings me back to Ancient Manners. After almost a year had passed since I submitted the pages to DP I looked on their site and could see no evidence that the book was even in the queue to be worked on. I decided to take the work I had prepared and finish it myself. The book was very difficult to work on. It had 90 illustrations, footnotes, Greek citations that needed to be transliterated, plus accents, umlauts, ligatures, and lots of OCR errors that I could only find by repeated readings of the text. Project Gutenberg estimates that it takes forty hours of work to prepare an e-book for their site. I think I took twice that long. On Saturday, June 12, 2011 the book was finally accepted. On its first day it was downloaded 21 times. By Monday it was number 19 on the Top 100 Downloads, coming in just ahead of War and Peace. It had been downloaded 405 times by then.
I created a custom EPUB with separated chapters, resized images, and other tweaks to create something that could be converted to a Kindle Store-worthy MOBI file and posted it to the Kindle Store. To date I have sold only one. A couple of weeks later I found that Amazon itself had taken my original donation to Project Gutenberg and created a book for the Kindle Store with it. Their book has no illustrations, but does have the captions for the illustrations. There is no visual cue that these are captions for missing illustrations. They just look like ordinary paragraphs, so you'll be reading along and some text you read previously is mysteriously repeated.
The e-book is free on the Kindle Store, so at least they priced it correctly. I put in a comment explaining the problem with the book and telling customers where they could get a nicely formatted version of the book. The comment was accepted. That comment may have led to my one sale.
In any case, be aware that any donations you make to Project Gutenberg may wind up on the Kindle Store as a free book.
- We ended up omitting the stills from the short subjects. Canadian copyright law is clear on photographs, but not so clear on stills from movies published in books.^