coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

I'm sure this has been asked before but I wouldn't even begin to know what search terms to try for :)

I am using WWW::Mechanize to scrape a site that has images and text next to them. I want to rip through and pull out all images and put them in an array. I'd use a similar regex to then slurp up all the text and place them in array #2 (the images and text have to be in the same order as they are found).

I have a regex that ripped out all useless junk in the HTML file keeping just the table that I'm looking for. I'm not sure how to loop over $page (content dump) to pull out every unique instance of an image WITHOUT using the image function within this module. Using this image function would still leave me stranded for trying to get the text to come with it.

Below is a sample of what I am working with

</a><br><br><table width="100%" cellpadding="2" cellspacing="0" border +="0"><tr><td align="left" valign="bottom"><img src='http://images.tek +-tips.com/items/image001.gif' alt='Image001' width='40' height='40' b +order='0'> Description of image here</td><td align="right" valign="bo +ttom"></td> </tr><tr><td align="left" valign="bottom"><img src='http://images.tek- +tips.com/items/image002.gif' alt='Image002' width='40' height='40' bo +rder='0'> Description of image here</td><td align="right" valign="bot +tom"></td>
I want all images to be in @images and all text next to that image be in @text. There is definitely a way to go through this in one pass and collect both but would it be easier having two separate regexes to do this?

These are not my strong point and I appreciate any and all help to get the data extracted.

Replies are listed 'Best First'.
Re: Pulling all instances of a regex out
by Roy Johnson (Monsignor) on Oct 04, 2005 at 19:02 UTC
    Don't parse HTML with regexen. Use a parser. Look at HTML::Parser. HTML::TableParser may also be useful.

    Caution: Contents may have been coded under pressure.
Re:Pulling all instances of a regex out
by SamCG (Hermit) on Oct 04, 2005 at 19:00 UTC
    Have you tried a module that's intended for parsing HTML, such as HTML::TokeParser (for which I believe there is even a wrapper, HTML::TokeParser::Simple)? It pulls tokens, and I'm pretty sure would get you what you want.

    Writing your own regexes to parse HTML has been described a bad idea by some pretty lofty monks (merlyn comes to mind). . .
Re: Pulling all instances of a regex out
by davido (Cardinal) on Oct 04, 2005 at 18:57 UTC

    This is a little fragile, to say the least. But then again, regexp approaches to parsing always are.

    use strict; use warnings; my $page = join '', <DATA>; my( @raw_groups ) = split /src\s*=\s*['"]/, $page; my( @images, @texts ); foreach my $raw ( @raw_groups ) { my( $image, $text ); next unless ( $image, $text ) = $raw =~ m/^ (http:.+?) # Capture the file URL ['"].+?> # Anchors (.+?) # Capture the text <\/td # Final anchor /isx; print "$image => $text\n"; push @images, $image; push @texts, $text; } __DATA__ </a><br><br><table width="100%" cellpadding="2" cellspacing="0" border +="0"><tr><td align="left" valign="bottom"><img src='http://images.tek +-tips.com/items/image001.gif' alt='Image001' width='40' height='40' b +order='0'> Description of image here</td><td align="right" valign="bo +ttom"></td> </tr><tr><td align="left" valign="bottom"><img src='http://images.tek- +tips.com/items/image002.gif' alt='Image002' width='40' height='40' bo +rder='0'> Description of image here</td><td align="right" valign="bot +tom"></td>

    Dave

Re: Pulling all instances of a regex out
by GrandFather (Saint) on Oct 04, 2005 at 19:42 UTC

    For something like this I would use XML::TreeBuilder. Use look_down to find the elements containing the stuff you want to pull out. For each element found itterate over the sub-elements to identify the image then pull out the image and the next sub-element which is the text.


    Perl is Huffman encoded by design.
Re: Pulling all instances of a regex out
by wfsp (Abbot) on Oct 05, 2005 at 10:25 UTC

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html; { local $/ = undef; $html = <DATA> } my $tp = HTML::TokeParser::Simple->new(\$html) or die "Couldn't parse $html: $!"; my (@results); while (my $t = $tp->get_token) { if ($t->is_start_tag('img')){ push @results, $t->get_attr('src'); } elsif ($t->is_text){ push @results, $t->as_is; } } print "*$_*\n" for @results; __DATA__ <br><br> <table width="100%" cellpadding="2" cellspacing="0" border="0"> <tr> <td align="left" valign="bottom"> <img src='http://images.tek-tips.com/items/image001.gif' alt='Image001 +' width='40' height='40' border='0'> Description of image here </td> <td align="right" valign="bottom"></td> </tr> <tr> <td align="left" valign="bottom"> <img src='http://images.tek-tips.com/items/image002.gif' alt='Image002 +' width='40' height='40' border='0'> Description of image here </td> <td align="right" valign="bottom"> </td> </tr> </table>

    The 'text' comes with a fair bit of white space which you need to remove.