Dealing with Word Compact HTML

apessos has asked for the wisdom of the Perl Monks concerning the following question:

Once again I come before the Perl Monks seeking wisdom. I have been plagued with editing large Word documents and I am looking for a simple way to grab data from the document once it's been converted to HTML. The problem lies with the non-uniform way the HTML is generated and newlines placed seemingly at random

The data I am trying to grab are contained within the <b> tags. Here is an example...

<p class=para><a name="watch dog"></a><b>watch dog
-</b> A big dog that makes sure that you don't do anything that you're
+ not supposed to).</p>

<p class=para><a name="WR"></a><b>wooden round –</b> A big piece of ro
+und wood.</p>
[download]

I changed the content to protect the information but the structure is the same. The problem is that 3/4 of the document the text between the <b> tags appear on the same line (bottom example). The other 1/4 of the document the <b> tags are spread out over multiple lines. (top example)

I wrote a simple oneliner that grabbed 3/4 of the data, but I don't know how or if it is possible to easily grab the other 1/4. Here is the oneliner.

perl -e 'while(<>){print "$1\n" if /<b>(.*)<\/b>/;}' smaller.txt
[download]

Any words of advice?

Comment on Dealing with Word Compact HTML Select or Download Code

Replies are listed 'Best First'.
Re: Dealing with Word Compact HTML by Fletch (Bishop) on Apr 14, 2004 at 14:45 UTC
Those attempting to parse arbitrary HTML with just a simple regexp are living in a state of sin and most likely doomed to failure. Use a proper parser (HTML::Parser and / or HTML::TreeBuilder).	[reply]
Re: Re: Dealing with Word Compact HTML by matija (Priest) on Apr 14, 2004 at 15:07 UTC
Don't forget he could use HTML::TokeParser or even HTML::TokeParser::Simple.	[reply]
Re: Re: Re: Dealing with Word Compact HTML by format_c (Initiate) on Apr 14, 2004 at 22:45 UTC
I tried a bit with HTML::Parser an I hate it because I think it's complicated to use. But parsing HTML with RegEx quickly become more complicated than parsing with HTML::Parser. So here's my snippet and I hope it'll help you: # This script will extract text which is incuded in <b> use strict; use HTML::Parser; local $/; my $html = <DATA>; my $p = HTML::Parser->new(api_version => 3, start_h => [\&b_start_handler,"tagname,self"] ); $p->parse($html); sub b_start_handler { my ($tagname,$self) = @_; return unless $tagname eq 'b'; $self->handler(text => [], '@{dtext}' ); $self->handler(end => \&b_end_handler,"tagname,self"); } sub b_end_handler { my($tag,$self) = @_; my $text = join("", @{$self->handler("text")}); print "$text\n---\n"; $self->handler("text", undef); $self->handler("start", \&b_start_handler); $self->handler("end", undef); } __DATA__ <P class=para><a name="watch dog"></a><b>watch dog -</b> A big dog that makes sure that you don't do anything that you're not supposed to).</p> <p class=para><a name="WR"></a><b>wooden round –</b> A big piece of ro und wood.</p> [download] Greets Alex	[reply] [d/l]
Re: Dealing with Word Compact HTML by b10m (Vicar) on Apr 14, 2004 at 15:01 UTC
You could convert the massive ammount of MS-HTML code to somewhat more common HTML first with tools like Word Unmunger or Demoroniser (which is a Perl app) :-) -- b10m All code is usually tested, but rarely trusted.	[reply]
Re: Dealing with Word Compact HTML by seattlejohn (Deacon) on Apr 14, 2004 at 15:20 UTC
Using a one-liner like yours to parse HTML is going to be problematic, because line breaks, carriage returns, tabs, "regular" (0x20) spaces, and a few other special characters are all considered equivalent whitespace. It's hopeless to try to predict whether the opening and closing tags will be on the same line or not. Like others, I'd strongly recommend using something like HTML::Parser. That said, if you really don't want to parse HTML for real, you can work around the problem by slurping the whole file into a single scalar and searching for the tags using a regex with the `/s` modifier. But be careful. `<b>` tags can in fact have attributes, like this: `<b style="font-size: 200%">`. Your regex will not catch cases like that, though if you have sufficient control over the formatting of the original documents this may not be a problem. $perlmonks{seattlejohn} = 'John Clyman';	[reply] [d/l] [select]
Re: Dealing with Word Compact HTML by relax99 (Monk) on Apr 14, 2004 at 15:44 UTC
One other alternative... Get OpenOffice. Open your Word documents using OpenOffice. Save them as html files. OpenOffice seems to do a much better job of keeping html files free from junk tags. I believe OpenOffice is also much easier to automate and it has a powerful and consistent API (haven't tried it, but judging from the docs), so you could do this all automatically from your perl program. If you do decide to install it, keep in mind that OpenOffice will try to change your file associations for the office documents and it is quite a pain to get them back to the original state.	[reply]
Re: Re: Dealing with Word Compact HTML by qq (Hermit) on Apr 14, 2004 at 19:21 UTC
OpenOffice has much less cruft in the html, but it still isn't very good (not xhtml, for starters). See this guide for details, and a solution. qq	[reply]
Re: Re: Re: Dealing with Word Compact HTML by relax99 (Monk) on Apr 15, 2004 at 12:42 UTC
That is probably true. However, I seem to think that it would be much easier to extract the data from the resulting html in comparison with Microsoft Office, which was the original problem.	[reply]
Re: Dealing with Word Compact HTML by eXile (Priest) on Apr 14, 2004 at 15:14 UTC
If you'd want to commit a cardinal sin you could try something like: `perl -0777 -ne 'print "$1\n" while(/<b>((.\|\n\|\r\|)?)<\/b>/gm);' test. +html` [download] the `-0777` will slurp the file at once the `(.\|\n\|\r\|)?)`will match any character including line-ending characters, and match them non-greedy. the `//gm` regex modifiers will match multiple lines and match as much times as possible. But for more than a quick hack I'd go with the solutions already offered. Update: to address seattlejohn's `<b name=value>` problem you could use: `perl -0777 -ne 'print "$1\n" while(/<b\b.?>((.\|\n\|\r\|)?)<\/b>/gm);' +test.html` [download] I think there will be lots of other situations where this oneliner won't match, and thats exactly the point why you should use a descent parser.	[reply] [d/l] [select]
Re: Dealing with Word Compact HTML by rje (Deacon) on Apr 14, 2004 at 15:42 UTC
At the risk of being insular, I might suggest breaking up the input based on the paragraph tags. This test script seems to do the trick: `$/ = "<p class=para>"; foreach (<DATA>) { s/\n//g; print "$1\n" if /<b>(.*)<.b>/; } __DATA__ <p class=para><a name="watch dog"></a><b>watch dog -</b> A big dog that makes sure that you don't do anything that you're + not supposed to).</p> <p class=para><a name="WR"></a><b>wooden round -</b> A big piece of ro +und wood.</p>` [download]	[reply] [d/l]
Re: Dealing with Word Compact HTML by apessos (Acolyte) on Apr 14, 2004 at 18:10 UTC
I wanted to thank everyone for their input and snippets of code. It seems the simplest answer is the one to have slipped my mind: use a HTML parse. While oneliners are always fun and usually quick to write, sometimes the problem calls for more than one line. Thanks again!	[reply]


Perl Monk, Perl Meditation
	PerlMonks