Regex's on Text of HTML (using HTML::Parser)

rjahrman has asked for the wisdom of the Perl Monks concerning the following question:

I've never used HTML::Parser before, and I'm a little confused on how one would use it. I've seen countless examples of removing HTML tags, which I to some extent understood, but I didn't find what I was looking for.

If I have the source of an HTML file in $pagesource (which I got with LWP), how can I run a series of regex's (especially substitute) on the visible text of the page, and then put the modified source (including the old tags) into $newsource? Thanks in advance!

Comment on Regex's on Text of HTML (using HTML::Parser)

Replies are listed 'Best First'.

Re: Regex's on Text of HTML (using HTML::Parser)
by PodMaster (Abbot) on May 24, 2003 at 00:16 UTC

Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?

Re: RexExp help: Highlight keywords in CGI search results, unless inside an HTML tag

pod

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Regex's on Text of HTML (using HTML::Parser)
by Popcorn Dave (Abbot) on May 24, 2003 at 03:46 UTC

If I understand correctly what you're trying to do, I think HTML::TokeParser may do a better job for you as it breaks everything up in to tokens and you could do your substitutions easily enough and build your HTML back based on your tokens.

Hope that helps!

There is no emoticon for what I'm feeling now.

[reply]

Re: Regex's on Text of HTML (using HTML::Parser)
by hacker (Priest) on May 24, 2003 at 13:44 UTC

HTML::TreeBuilder

merlyn has a recent article called "The Wrong Parser for the Right Reasons" which covers something very similar. Give it a read, and see if it suits your goal. A brief synopsis from the top of the article:

More and more these days, you get faced with a problem with angle brackets somewhere in the data. How do you find what you're looking for in HTML or XML data?
At first glance, the question has an obvious answer. If you have an HTML task, you use HTML::Parser or some derived or wrapper class. If you have an XML task, you use XML::Parser or XML::LibXML. But maybe the obvious answer isn't always the best. Let's look at a couple of cases.

[reply]