SyntaxHighlighter

2013-07-24

When Converting a PDF to ePub, Always Use Caution

When creating an ePub or Mobi electronic book it is always best to start with the source material.  Unfortunately there are times when you are given a project and the only thing you have to work with is a PDF.  As tempting as it is to click that 'convert' button to make it an ePub, you should use caution.

Like many things in life, just because it is easy does not mean it is the best way to do something.  The automatic conversion process from PDF to an ePub format can cause a variety of issues, mainly stemming from bad code that can cause your document to be rendered in ways you may not have anticipated.

And no matter what, to get your ePub to look nice and clean, be it on an old smartphone with a small screen or a shiny new iPad, you will have to do some editing in the code to clean things up.

The first issue you will run across in the conversion process is that PDFs are formatted to fit a printed page, and it is rather difficult for the program you are using (for example Calibre) to determine the end of an actual paragraph compare to the end of a line on the page.

So your ePub will look much like this:


And the reason for the strange spacing and end of lines becomes apparent when you take a look at the code:


Each of those p class="calibre2" starts a new paragraph (ePub uses HTML code for display).  And when you open the CSS style sheet for your ePub you will find a lot of junk entries that can clog up your publication, and possibly cause problems for people trying to read your book.
Note: Purists may prefer to use a text dumping program like PDF2Text to just get a copy of the pure, unformatted text.  (Provided it has a text layer, and if not you will have to do a bit of OCRing.)

When editing your ePub using Sigil (which is a free ePub editor for Win/Mac/Linux) you can press F2 to switch between normal word processing-like mode and the editing code feature.

In the sample image above, if you wanted to start cleaning up the document you could do a Find/Replace:


That will eliminate a lot of the initial clutter.  And you can always add an over-riding format to the CSS code so that every paragraph in your document shows up the same.

As you go through the file you will start noticing a pattern for italics, bold, and other common formatting that will end up, due to the conversion, with their own calibre## entry in the CSS code.

After you start editing your document, the code will become far less intimidating.  After you have cleaned it up you should end up with something that looks like this:


The good thing about clean code like this is that the text will flow perfectly no matter what program or device your end customer is using to read your book.


And in the end, the customer will remember if they had a bad experience with your ePub or Mobi document previously, and may hesitate getting your next book.  So it always pays to take some extra time to do make your ePub or Mobi book with clean code.