Cleaning Microsoft Word Generated HTML with Regular Expressions and TIDY

Authoring HTML content (copy) in Microsoft Word is generally a bad idea. Most of us have already found this out the hard way. Some of us haven't. Some of us haven't a clue.

Although I rarely do web development anymore, in the past, the vast majority of the time I would find myself working on web projects with people who didn't know the first thing about web development.

The routine was far too typical:

  1. "Copy" intended for dissemination on the World Wide Web gets authored in Microsoft Word.
  2. Along the way, the author freely and inconsistently applies all sorts of annoying styles (typeface size, color, tables for layouts, etc.) and includes 3rd grade "clip-art" which looks so bad it hurts your eyes. These styles vary from page to page, and even within a specific document.
  3. Then the author tells you that he or she did you a favor and saved the document(s) as HTML using the "Save As Web Page" feature.

What a disaster! We are talking about a series of 50+ page documents here. That's not professional, and they are defiantly NOT helping you out.

So what can you do?

Well, assuming you don't want to quit, the first thing is to realize that these people are clueless. Arguing with them or expecting them to learn something new is not reasonable. If they were capable of learning, they would have learned HTML like this rest of us in 1994.

So ok, maybe we can give them a WYSIWYG tool? Wrong. There are no tools to help the clueless. The problem is not that you have the wrong technology; it's that you have the wrong people. And if you take away the only remaining piece of software they actually know how to use, they will be worthless to you. It's your fault for associating with these people, you are the web developer and it's therefore now your problem.

So really, what can you do?

Moving from denial to acceptance and taking responsibility for the mess you have made is a good start.

I'm not going to go into what regular expressions are except to say that many times it seems as if they are the only friends you have in this business.

Okay, let's look at some regex magic. First let me say that these could be more efficent, combined, etc. Most are split because they are easier to understand. And if I were processing more than one file, I'd write a script instead.

Because vim is my editor of choice the first thing I usually do is to replace all the newline characters in the offending file with space characters. I do this because sometimes HTML tags will span multiple lines and will therefore not match my regex. Don't worry, you can clean this up with tidy later.

s/\n/\ /g

Ok Let's Strip

Remove all <span foo="bar"> and </span> tags:

s/<span\ [^>]*>//g
s/<\/span>//g

Strip all in-line styles:

s/\ style="[^"]*"//g

Remove all classes:

s/\ class="[^"]*"//g

If a document does not contain actual tables (like those used to present data, as opposed to layout), I might choose remove those as well (note that it actually removes any that begins "<t" which covers tags like: table, th, tbody, tr, etc.):

s/<t[^>]*>//g
s/<\/t[^>]*>//g

Finally, I save the file and run it through tidy which indents and provides features like HTML to XHTML translation:

tidy -i -asxhtml dirty.htm > clean.htm