Building a publishing tool for online publications

Posted on May 23, 2012


In the past months I have been working hard on writing a book on Design Patterns and refactoring.

Publishing Word and using DocumentShare

To make the book as usable as possible I decided to go for three publication forms: HTML, ePub and PDF. Each have their benefits and drawbacks and each therefore has a specific reason to be used.

In order to publish the base content properly I decided to grab one of my older concepts: DocumentShare, and give it a new life, publishing both HTML and ePub.

The publishing process from a Word document to HTML, PDF and ePub

Below you find some of the nice highlights on the results from the new DocumentShare tool.

For the published result, go here. As I did not make an effort to find good sets of alternatives for the fonts I like, you might see slightly different results than shown below in the screen shots.


This was something I added only recently. As I was pasting some code in my document, I decided it would be nice to format it properly for the web.

The original text in Word

An example of the published result in HTML

The tags are needed to indicate to DocumentShare that there is text that needs specific formatting. In this case as code.

The code to make this formatting happening is relatively simple.

It knows some basic keywords used in code and additionally takes all text marked as Italics (“MyClass” in this screenshot) to be special keywords to be presented as such.

Remarks are recognized by the “//” start and the </p> ending. I did not build in recognition for “/*” and “* ” type of remarks, but that might follow when I build a specific parser for “Code to HTML”.

Links and chapter names

Chapters are recognized by the “<h>” (header) tag. So all I do is using the HTML published from Word, clean out all the Word-specific tags and references, produce almost crispy clear HTML and then break that HTML up in blocks, with the “<h” tag as the breaking point.

Like this:

String[] myChapters=wordHTMLtext.split(“<h”)

To create a stable header-tag/header ID to publish the files I add an extra tag into the document. This tag is shown in red below.

Header tag in Word

The Tag has two main parts: “::Tag” and  the tag name: “PAT-ADP”. Additonally you can add the alternate name of the chapter, to be shown when you create a link to that chapter. In this case” “Discussing the Adapter Pattern”.

DocumentShare will take these meta-tags and translate them to something usable, as shown below.

Showing the alternate Header name in the document header

When you create a link to this chapter, in Word you do the following (in red):

Creating a link in Word

The link as it will show up in HTML (and ePub)

The link as it shows in HTML (and in the ePub)

Style for tags

As we want to publish the document to PDF as well, we need a way to hide the tags we use for chapters, links, code and indents (not shown in this post). After all, it looks quite weird to have red tags that seemingly have no meaning in your PDF.

And so we use a style for the Tags, which we can give a white font-color when we create the PDF. While the text and tags are still there, they are hidden for the eye.

Styles: making tags invisible by making the text-color white


One of the first things I focused on was the navigation. I wanted this to be in sight all the time and also to cater your interest. So I created three “layers”: the chapter / page you are reading now, the chapter it is part of and the entire document.

It took some fine-tuning to decide how many levels each layer would show in the document structure. In the end I settled for “less is more”.

Navigation on the page

Navigation in the chapter

Navigation on all main chapters

Deriving the navigation

The navigation is derived from the concrete chapters themselves. As I rip apart the HTML as created by Word, I create an in-memory “database” of each chapter, containing the text of that chapter and the header/title of that chapter.

As I can derive the chapter level (<h1>, <h2>, <h3> and so on) I can deduct where this chapter fits in the total structure, assuming that if a <h2> or <h3> follows on a <h1> it is part of that chapter,adding it to the “parent” and building the tree-structure that way.

Building the pages and including sub-chapters

This structure and approach allows for a lot of freedom in producing navigational lists and subnavigations. It also allows me to use “include subchapters” (see the image under “Links and chapter names” above) when I want a chapter to contain all the subchapters instead of publishing these as separate pages.

By default, DocumentShare assumes I want to include all chapters from level 3 (<h3>). In some cases, however, I already want to do this at level 2. For instance in chapter 1 and part 9 where level 2 is where I start each Design Patterns.

The HTML that comes out of the process

The HTML produced by Word is really dirty and ugly. And it really does not follow any rules of logic. Bullet points and numbered lists, for instance, are not published by Word as organized lists, but as paragraphs with styling. That is also why I started to include meta-tags like

and [indent].

HTML from Word

What you will find in each <p> tag are endless style-references and <span> tags taking care of specific formatting you might have selected (or not).

Most of this styling is the result of Word not cleaning up the formatting as you go and type, so when you study the HTML form Word even deeper you will find several “dead” tags enclosing nothing at all and styling and formatting that nothing with some kind of font and color.

Instead, I take that HTML and “nuke” it with regular expressions, taking almost all tags (<img> and <a href> excluded) and strip them from anything and everything they contain. Which leads to relatively clean and almost spartan HTML like shown below:

Cleaned up HTML from DocumentShare

Tags like <span> and <div> are completely removed, as they usually make no sense at all if you want your HTML to be clean and simple.

Adding CSS

To finish the process, I add CSS to style the paragraphs and make cute headers so that everything makes sense and looks like something you might like to read.

Posted in: 4: Concepts