Re: Re: Re: Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?

That did the trick. Now if you'll indulge the following question, using the same example how would I change all the href and img src tags to add http://www.thesite.com. An example would be href="/my.gif" becoming become href="http://www.thesite.com/my.gif".

Comment on Re: Re: Re: Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? by BrowserUk (Patriarch) on Aug 24, 2003 at 19:51 UTC
Assuming that you mean *only* in the extracted snippet, and not in the rest of the original document, then I might do something like this: `#! perl -slw use strict; use LWP::Simple; $_ = get( 'http://the.site.com/path' ); if( m[(<table.?</table>)]si ) { ($_ = $1) =~ s[(href\|src)\s=\s*"([^"]+)"] #" [$1="http://the.site.com$2"]sig; print; exit; }` [download] Throw that in a script and redirect the ouput to your file. But please note. This is fragile! Anything changes on the web page and you have to change the regex. Many types of change would not be accomadatable easily. Given the examples of doing it 'the right way' offered above, your probably better off using one of those as your starting point. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller If I understand your problem, I can solve it! Of course, the same can be said for you.	[reply] [d/l]

Replies are listed 'Best First'.

Re: Re: Re: Re: Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?
by BrowserUk (Patriarch) on Aug 24, 2003 at 19:51 UTC

Assuming that you mean only in the extracted snippet, and not in the rest of the original document, then I might do something like this:

#! perl -slw
use strict;
use LWP::Simple;

$_ = get( 'http://the.site.com/path' );
if( m[(<table.*?</table>)]si ) {

     ($_ = $1) =~ s[(href|src)\s*=\s*"([^"]+)"] #"
            [$1="http://the.site.com$2"]sig;
     print;
     exit;
}
[download]

Throw that in a script and redirect the ouput to your file.

But please note. This is fragile! Anything changes on the web page and you have to change the regex. Many types of change would not be accomadatable easily. Given the examples of doing it 'the right way' offered above, your probably better off using one of those as your starting point.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

[reply]
[d/l]