Very specific HTML parsing question

russmann has asked for the wisdom of the Perl Monks concerning the following question:

The job is basically to take M$ Word generated HTML and make it useful. (The reason I have M$ Word generated HTML is out of my control). The script is 99% done, but I want to extract a few pieces of data out of the HTML and put them in variables for later use.

The first thing I want is:
The text between the 3rd and tags. (not the first or 2nd).
This is to extract the article title.

The 2nd thing I want is:
Everything in the page past this:
Notes:
This is to extract the notes.

The 3rd thing I want is:
The text between the 5th and tags, but only if the text begins with "by" (as in by Larry Wall).
This is to extract the author line.

Comment on Very specific HTML parsing question

Replies are listed 'Best First'.
Re: Very specific HTML parsing question by ChemBoy (Priest) on Sep 07, 2001 at 20:41 UTC
Ordinarily I'd say this is a job for HTML::TokeParser. What you describe would be very simple to implement using it, and it would be fairly easy to fix if it breaks when MSWord's HTML output changes in some bizarre way in a future edition (extremely likely to happen). There's even a tutorial here about it, which should make it that much easier to figure out. On the other hand, it seems somewhat churlish to tell you to bring in the whole HTML::Parser suite just for the last 1% of your code... but actually, I'm going to. The reason is this: I could try to write the regex you need (though it would take longer than writing it with TokeParser), but I would probably fail. Several people would then respond with corrections explaining what I had missed, and that I was stupid to try to use a regex instead of TokeParser. And they would be right. Somebody might actually supply a regex that would do what you want, but it would probably end up being fairly long and painful to read, and they'd probably finish by saying you shouldn't use it, you should use HTML::TokeParser. The upside is that you may look at TokeParser and realize that it could vastly simplify your script to use it in some other places--this is, after all, what CPAN modules are best at. :-) You should also look at the base HTML::Parser module, just to see if the model it uses for parsing makes more sense to you--both systems have their advocates. If God had meant us to fly, he would never have given us the railroads. --Michael Flanders	[reply]

Replies are listed 'Best First'.

Re: Very specific HTML parsing question
by ChemBoy (Priest) on Sep 07, 2001 at 20:41 UTC

Ordinarily I'd say this is a job for HTML::TokeParser. What you describe would be very simple to implement using it, and it would be fairly easy to fix if it breaks when MSWord's HTML output changes in some bizarre way in a future edition (extremely likely to happen). There's even a tutorial here about it, which should make it that much easier to figure out.

On the other hand, it seems somewhat churlish to tell you to bring in the whole HTML::Parser suite just for the last 1% of your code... but actually, I'm going to. The reason is this: I could try to write the regex you need (though it would take longer than writing it with TokeParser), but I would probably fail. Several people would then respond with corrections explaining what I had missed, and that I was stupid to try to use a regex instead of TokeParser. And they would be right. Somebody might actually supply a regex that would do what you want, but

it would probably end up being fairly long and painful to read, and
they'd probably finish by saying you shouldn't use it, you should use HTML::TokeParser.

The upside is that you may look at TokeParser and realize that it could vastly simplify your script to use it in some other places--this is, after all, what CPAN modules are best at. :-)

You should also look at the base HTML::Parser module, just to see if the model it uses for parsing makes more sense to you--both systems have their advocates.

If God had meant us to fly, he would *never* have given us the railroads.
--Michael Flanders

[reply]