Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Dealing with Word Compact HTML

by apessos (Acolyte)
on Apr 14, 2004 at 14:40 UTC ( [id://345072]=perlquestion: print w/replies, xml ) Need Help??

apessos has asked for the wisdom of the Perl Monks concerning the following question:

Once again I come before the Perl Monks seeking wisdom. I have been plagued with editing large Word documents and I am looking for a simple way to grab data from the document once it's been converted to HTML. The problem lies with the non-uniform way the HTML is generated and newlines placed seemingly at random

The data I am trying to grab are contained within the <b> tags. Here is an example...

<p class=para><a name="watch dog"></a><b>watch dog -</b> A big dog that makes sure that you don't do anything that you're + not supposed to).</p> <p class=para><a name="WR"></a><b>wooden round –</b> A big piece of ro +und wood.</p>

I changed the content to protect the information but the structure is the same. The problem is that 3/4 of the document the text between the <b> tags appear on the same line (bottom example). The other 1/4 of the document the <b> tags are spread out over multiple lines. (top example)

I wrote a simple oneliner that grabbed 3/4 of the data, but I don't know how or if it is possible to easily grab the other 1/4. Here is the oneliner.

perl -e 'while(<>){print "$1\n" if /<b>(.*)<\/b>/;}' smaller.txt
Any words of advice?

Replies are listed 'Best First'.
Re: Dealing with Word Compact HTML
by Fletch (Bishop) on Apr 14, 2004 at 14:45 UTC

    Those attempting to parse arbitrary HTML with just a simple regexp are living in a state of sin and most likely doomed to failure. Use a proper parser (HTML::Parser and / or HTML::TreeBuilder).

        I tried a bit with HTML::Parser an I hate it because I think it's complicated to use. But parsing HTML with RegEx quickly become more complicated than parsing with HTML::Parser. So here's my snippet and I hope it'll help you:
        # This script will extract text which is incuded in <b> use strict; use HTML::Parser; local $/; my $html = <DATA>; my $p = HTML::Parser->new(api_version => 3, start_h => [\&b_start_handler,"tagname,self"] ); $p->parse($html); sub b_start_handler { my ($tagname,$self) = @_; return unless $tagname eq 'b'; $self->handler(text => [], '@{dtext}' ); $self->handler(end => \&b_end_handler,"tagname,self"); } sub b_end_handler { my($tag,$self) = @_; my $text = join("", @{$self->handler("text")}); print "$text\n---\n"; $self->handler("text", undef); $self->handler("start", \&b_start_handler); $self->handler("end", undef); } __DATA__ <P class=para><a name="watch dog"></a><b>watch dog -</b> A big dog that makes sure that you don't do anything that you're not supposed to).</p> <p class=para><a name="WR"></a><b>wooden round –</b> A big piece of ro und wood.</p>
        Greets Alex
Re: Dealing with Word Compact HTML
by b10m (Vicar) on Apr 14, 2004 at 15:01 UTC

    You could convert the massive ammount of MS-HTML code to somewhat more common HTML first with tools like Word Unmunger or Demoroniser (which is a Perl app) :-)

    --
    b10m

    All code is usually tested, but rarely trusted.
Re: Dealing with Word Compact HTML
by seattlejohn (Deacon) on Apr 14, 2004 at 15:20 UTC
    Using a one-liner like yours to parse HTML is going to be problematic, because line breaks, carriage returns, tabs, "regular" (0x20) spaces, and a few other special characters are all considered equivalent whitespace. It's hopeless to try to predict whether the opening and closing tags will be on the same line or not.

    Like others, I'd strongly recommend using something like HTML::Parser.

    That said, if you really don't want to parse HTML for real, you can work around the problem by slurping the whole file into a single scalar and searching for the tags using a regex with the /s modifier. But be careful. <b> tags can in fact have attributes, like this: <b style="font-size: 200%">. Your regex will not catch cases like that, though if you have sufficient control over the formatting of the original documents this may not be a problem.

            $perlmonks{seattlejohn} = 'John Clyman';

Re: Dealing with Word Compact HTML
by relax99 (Monk) on Apr 14, 2004 at 15:44 UTC

    One other alternative... Get OpenOffice. Open your Word documents using OpenOffice. Save them as html files. OpenOffice seems to do a much better job of keeping html files free from junk tags. I believe OpenOffice is also much easier to automate and it has a powerful and consistent API (haven't tried it, but judging from the docs), so you could do this all automatically from your perl program.

    If you do decide to install it, keep in mind that OpenOffice will try to change your file associations for the office documents and it is quite a pain to get them back to the original state.

      OpenOffice has much less cruft in the html, but it still isn't very good (not xhtml, for starters).

      See this guide for details, and a solution.

      qq

        That is probably true. However, I seem to think that it would be much easier to extract the data from the resulting html in comparison with Microsoft Office, which was the original problem.
Re: Dealing with Word Compact HTML
by eXile (Priest) on Apr 14, 2004 at 15:14 UTC
    If you'd want to commit a cardinal sin you could try something like:
    perl -0777 -ne 'print "$1\n" while(/<b>((.|\n|\r|)*?)<\/b>/gm);' test. +html
    • the -0777 will slurp the file at once
    • the (.|\n|\r|)*?)will match any character including line-ending characters, and match them non-greedy.
    • the //gm regex modifiers will match multiple lines and match as much times as possible.
    But for more than a quick hack I'd go with the solutions already offered.

    Update: to address seattlejohn's <b name=value> problem you could use:
    perl -0777 -ne 'print "$1\n" while(/<b\b.*?>((.|\n|\r|)*?)<\/b>/gm);' +test.html
    I think there will be lots of other situations where this oneliner won't match, and thats exactly the point why you should use a descent parser.
Re: Dealing with Word Compact HTML
by rje (Deacon) on Apr 14, 2004 at 15:42 UTC
    At the risk of being insular, I might suggest breaking up the input based on the paragraph tags. This test script seems to do the trick:
    $/ = "<p class=para>"; foreach (<DATA>) { s/\n//g; print "$1\n" if /<b>(.*)<.b>/; } __DATA__ <p class=para><a name="watch dog"></a><b>watch dog -</b> A big dog that makes sure that you don't do anything that you're + not supposed to).</p> <p class=para><a name="WR"></a><b>wooden round -</b> A big piece of ro +und wood.</p>
Re: Dealing with Word Compact HTML
by apessos (Acolyte) on Apr 14, 2004 at 18:10 UTC
    I wanted to thank everyone for their input and snippets of code. It seems the simplest answer is the one to have slipped my mind: use a HTML parse. While oneliners are always fun and usually quick to write, sometimes the problem calls for more than one line. Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://345072]
Approved by Corion
Front-paged by pbeckingham
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (1)
As of 2024-04-25 19:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found