Authoring HTML content (copy) in Microsoft Word is generally a bad idea. Most of us have already found this out the hard way. Some of us haven't. Some of us haven't a clue.
Although I rarely do web development anymore, in the past, the vast majority of the time I would find myself working on web projects with people who didn't know the first thing about web development.
The routine was far too typical:
What a disaster! We are talking about a series of 50+ page documents here. That's not professional, and they are defiantly NOT helping you out.
Well, assuming you don't want to quit, the first thing is to realize that these people are clueless. Arguing with them or expecting them to learn something new is not reasonable. If they were capable of learning, they would have learned HTML like this rest of us in 1994.
So ok, maybe we can give them a WYSIWYG tool? Wrong. There are no tools to help the clueless. The problem is not that you have the wrong technology; it's that you have the wrong people. And if you take away the only remaining piece of software they actually know how to use, they will be worthless to you. It's your fault for associating with these people, you are the web developer and it's therefore now your problem.
Moving from denial to acceptance and taking responsibility for the mess you have made is a good start.
I'm not going to go into what regular expressions are except to say that many times it seems as if they are the only friends you have in this business.
Okay, let's look at some regex magic. First let me say that these could be more efficent, combined, etc. Most are split because they are easier to understand. And if I were processing more than one file, I'd write a script instead.
Because vim is my editor of choice the first thing I usually do is to replace all the newline characters in the offending file with space characters. I do this because sometimes HTML tags will span multiple lines and will therefore not match my regex. Don't worry, you can clean this up with tidy later.
s/\n/\ /g
Remove all <span foo="bar"> and </span> tags:
s/<span\ [^>]*>//g
s/<\/span>//g
Strip all in-line styles:
s/\ style="[^"]*"//g
Remove all classes:
s/\ class="[^"]*"//g
If a document does not contain actual tables (like those used to present data, as opposed to layout), I might choose remove those as well (note that it actually removes any that begins "<t" which covers tags like: table, th, tbody, tr, etc.):
s/<t[^>]*>//g
s/<\/t[^>]*>//g
Finally, I save the file and run it through tidy which indents and provides features like HTML to XHTML translation:
tidy -i -asxhtml dirty.htm > clean.htm