eBooks - HTML and EPUB

Fri 16 January 2015

I've written a novel and I'm in the final stages of revising it. Reading it on the computer screen was acceptable in the first few reviewing passes, but now I feel the need to read it as I read many books - on my phone. Not only does this give me a fresh perspective on what I've written, it also forces me to avoid editing the text directly as I go, which distracts greatly from reading it as a "reader", if you follow me. If I see something I don't like (and it's still alarmingly often!), I am restricted to leaving a short note in the phone app. I was using the Cool Reader app, but due it's weird user interface, I decided to switch to Moon Reader Pro.

So, to do this I had to convert the book from the format in which I wrote it to a format suitable for a mobile device. This blog post is based on the notes I kept whilst doing that.

I wrote the text using LibreOffice writer (though I did try the distraction-free application FocusWriter for a while) and I saved it in .odt format. LibreOffice is a free and open source alternative to Microsoft Office and .odt is the open file format that it prefers to use (though you can use .doc and .docx if you prefer). I used text formatting sparingly: a header style (large font, bold) for Chapter headings and italics in a few places in the text. I broke up chapters into sections using asterisks with blank lines above and below. Paragraphs were separated with blank lines (i.e. two newline characters) and no new-lines were used inside a paragraph, i.e. I let the text editor wrap lines for me. No other formatting was used.

I decided to use ePub format for reading on my phone as this standard was clear, well-documented and also widely supported. There were also good technical reasons for using it. I tried two ways to do the conversion. The first was to use an extension for LibreOffice called Writer2ePub and the other was to use a separate application called Calibre. I got both to work, but it seems now that Writer2ePub isn't working so well with LibreOffice version 4 - see this comparison page.

In both cases, the output was acceptable, in that I could read it on my phone, but the chapters weren't automatically recognised as such and so there was no useful table of contents. Also, the blank lines I'd used to separate paragraphs, which were helpful when I was writing, resulted in large, wasteful and distracting gaps when viewed on my phone. I'm sure both issues could be addressed in Calibre, and perhaps in Writer2ePub, but I decided on a different path after listening to a couple of podcasts by Jon Kulp on his experiences with eBooks - a music one and with Gutenberg eBooks. I recommend you listen to both, but if don't have time, then there are some good links in his show-notes.

What I learned from Jon was that instead of starting from a .odt file (or .doc or .docx file), I could make the conversion to eReader (and other) formats simpler by using HTML and some simple CSS. That said, if, unlike me, you like rich GUIs with oodles of buttons, then this way of writing may not be to your taste. But before you proclaim yourself a luddite and unlikely to be able to write HTML, do have a look at a file or two - I provide a link to an example below. The amount of HTML you need to learn is tiny, and it's much easier than writing a web-page.

The other good reason to go with HTML and CSS is that it is used inside the ePub format - this is the good technical reason I referred to above. This means that you could create the ePub files manually, but I chose to let Calibre to do the conversion work because some fiddly meta-data files are required by ePub. But understanding the HTML and CSS you created means that some post-conversion tweaking is made easier, and Calibre has features to help with that too.

I should also make clear that HTML is simple enough that I am comfortable editing it directly, i.e. I could ditch the need for .odt or .doc files and full-blown word processors such as LibreOffice Writer or Microsoft Word. A simple text editor will suffice, which in my case is Kate.

So, my first task was to get my text out of a .odt format and into HTML format. LibreOffice can export to an .html file, but when I used it, the HTML it produced was full of extraneous markup. I was after simplicity and having loads of <p superfluousAttribute="distractingValue> tags in my .html file was not on. No criticism of LibreOffice here - just that automated conversion tools inevitably generate output that is formatted to be syntactically correct, but not very human-readable.

Instead, I opted to save the file as text from LibreOffice and then do some tidying up on the command line. I've included the gorey details below if you want to see them. Details aside, the two main things were to enclose all paragraphs inside a pair of <p> and </p> tags, and chapter headings inside <h1> and </h1> tags. Next I created a simple CSS file based on the example provided by Jon Kulp and was able to get my first look at my eBook by opening it my web browser (firefox). I was quite pleased with the result. I'm not ready to share my novel yet, but you can look at this example HTML file (NOTE: it's readable but ugly in a browser, but if you right click on it you should see the option to read the source code) and also the CSS file that goes with it. The main difference between my CSS and the one I took from Jon Kulp is the removal of some styles I didn't need and the addition of the in-chapter section break:

.secbreak {
    text-align: center;
    text-indent: 0;
}

This means that this snippet of HTML will produce the asterisk that breaks sections within a chapter:

<p class="secbreak">*</p>

Once I had my .html file, I started up Calibre, clicked the big, red "Add books" button, selected the .html file and it loaded it and (without prompting me) turned it into what Calibre calls "ZIP" format. Next, I selected that book in the main pane and hit the "Convert books" button. By default it selected ePub as the output format. I then added as my name as the author, and at this point you can enter any meta-data you wish, and add a book cover graphic.

Next I went to the "Look & Feel" section (left panel of the convert window). I left all settings to defaults, but I pasted in the contents of my CSS file in the "Extra CSS" pane. There's much more you can do, as described here, but I just hit the OK button and after a few seconds it completed the conversion and a link to the "EPUB" version appeared on right of the main Calibre window. Clicking this opened a preview of my book complete with its table of contents which let me jump straight to any chapter. Job done!

Manual conversion of odt to html

Here are the sequence of steps that I took to convert an odt to an html file.

Open the .odt file in LibreOffice and go to File->Save As in the menu. In the dialog window choose File type as "Text - Choose Encoding (.txt)" and check the "Edit filter settings" option and click OK. In the filter, I selected LF (line feed) to be the newline character and UTF-8 encoding. The file is saved as book.txt.

The following steps were executed on the command line in linux. If you don't have access to the command line, or you don't like it, you may be able to perform some operations using a "Find and Replace" feature in a good text editor.

Replace Chapter * with <h1>Chapter *</h1>

sed 's/^Chapter.*/<h1>&<\/h1>/' book.txt > book_h1chapters.txt

Remove double newlines that I'd used to separate paragraphs

sed '/^$/d' book_h1chapters.txt > book_nodoublenewlines.txt

Insert <p></p> tags around each line:

sed 's/^.*$/<p>&<\/p>/' book_nodoublenewlines.txt > book_paratags.txt

Turn <p><h1> into <h1>

sed 's/^<p><h1>/<h1>/' book_paratags.txt > book_ph1remove.txt

Turn </h1></p> into </h1>

sed 's/<\/h1><\/p>/<\/h1>/' book_ph1remove.txt > book_h1premove.txt

Then I manually edited it in my text editor (loaded as UTF-8) to do the following:

  • wrapped whole lot in <html></html>
  • added <head>...</head> containing <title>...</title> at the start
  • added <body>...</body>
  • changed Prologue from p to h1 tags to make it a chapter
  • saved it as book_manual.html,

To make the HTML more readable whilst I edited it, I decided to insert blank lines after the HTML tags - these have no effect on the final formatting.

sed 's/<\/.*>/&\n/' book_manual.html > book_final.html