Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to create a small perl web bot to scrape a page but it endlessly just sits there until I run out of memory. Can anyone help me figure out how to fix it?
my @content = get("$url$cnt"); my $content = join("\n", @content); push(@images, "http://www.imagebeaver.com/files/images/$1:$1") whil +e $content =~ m#/files/images/thumb/([a-zA-Z0-9]+\.jpg)#g;
This is a chunk of HTML on the page, I'm just trying to match the /files/images/thumb/(imagename) of the thumbnails and put them in @images
b>Gallery 115564:</b><br><br><a href="http://www.imagebeaver.com/view. +php?mode=gallery&g=115564&photo=1" target="_blank"><img src="http://w +ww.imagebeaver.com/files/images/thumb/37021412066526319.jpg" border=" +0"></a> <a href="http://www.imagebeaver.com/view.php?mode=gallery&g=1 +15564&photo=2" target="_blank"><img src="http://www.imagebeaver.com/f +iles/images/thumb/99821412066526320.jpg" border="0"></a> <a href="htt +p://www.imagebeaver.com/view.php?mode=gallery&g=115564&photo=3" targe +t="_blank"><img src="http://www.imagebeaver.com/files/images/thumb/62 +341412066526321.jpg" border="0"></a> <a href="http://www.imagebeaver. +com/view.php?mode=gallery&g=115564&photo=4" target="_blank"><img src= +"http://www.imagebeaver.com/files/images/thumb/33641412066526322.jpg" + border="0"></a> <a href="http://www.imagebeaver.com/view.php?mode=ga +llery&g=115564&photo=5" target="_blank"><img src="http://www.imagebea +ver.com/files/images/thumb/5361412066526323.jpg" border="0"></a> <a . . .

Replies are listed 'Best First'.
Re: regex hangs forever
by grep (Monsignor) on Dec 15, 2006 at 03:52 UTC
    It is highly encouraged to reduce your code to a simple test case that demonstrates the problem. Not just so there is less for us to read through, but also to help your solve you problem beforehand and reduce red herrings. If you had run a simple test case, as I illustrate below, you would have found that your test data and regex work just fine. So you need to look for something besides the regex.

    Try running the code through the perl debugger perl -d to see where it really gets stuck at.

    use strict; use warnings; use Data::Dumper; my @images; my $content = '<a href="http://www.imagebeaver.com/view.php?mode=galle +ry&g=115564&photo=1" target="_blank"><img src="http://www.imagebeaver +.com/files/images/thumb/37021412066526319.jpg" border="0"></a> <a hre +f="http://www.imagebeaver.com/view.php?mode=gallery&g=115564&photo=2" + target="_blank"><img src="http://www.imagebeaver.com/files/images/th +umb/99821412066526320.jpg" border="0"></a> <a href="http://www.imageb +eaver.com/view.php?mode=gallery&g=115564&photo=3" target="_blank"><im +g src="http://www.imagebeaver.com/files/images/thumb/6234141206652632 +1.jpg" border="0"></a> <a href="http://www.imagebeaver.com/view.php?m +ode=gallery&g=115564&photo=4" target="_blank"><img src="http://www.im +agebeaver.com/files/images/thumb/33641412066526322.jpg" border="0"></ +a> <a href="http://www.imagebeaver.com/view.php?mode=gallery&g=115564 +&photo=5" target="_blank"><img src="http://www.imagebeaver.com/files/ +images/thumb/5361412066526323.jpg" border="0"></a> <a '; push(@images, "http://www.imagebeaver.com/files/images/$1:$1") while $ +content =~ m#/files/images/thumb/([a-zA-Z0-9]+\.jpg)#g; print Dumper \@images;

    grep
    1)Gain XP 2)??? 3)Profit

Re: regex hangs forever
by Devanchya (Beadle) on Dec 15, 2006 at 04:55 UTC
    To add to what grep stated, with HTML you should look at the helper tools in CPAN to go through the HTML. One good one is http::parser

    Reasons:

  • More than one person has read the code, they most likely thought of more items that could appear on the html page
  • the overhead is higher, but your not re-inventing the wheel.
  • What if you want to search for more than one thing in a week? In a year? Do you want to rewrite your search?
  • Did I mention re-invent the wheel thing?
    --
    Even smart people are dumb in most things...
      I have a related question. I am coincidentally looking for a way to read table rows from the content of an HTTP::Response object. I was going to fire up a touch of the HTML::Parser, but I was first going to look for a built in. Any tips?

      UPDATE: I had solved my question above once before, but coudln't find the code in my massive code-scrapbook. What I wanted was HTML::TreeBuilder. It is the bomb.

      -Paul

        If you are looking to get data from HTML tables, HTML::TableExtract can often do it with much less work than using something like HTML::Parser or HTML::TreeBuilder directly, especially if the tables are complex...


        We're not surrounded, we're in a target-rich environment!
Re: regex hangs forever
by vladdrak (Monk) on Dec 15, 2006 at 05:10 UTC
    Try this:
    my @content = get("$url$cnt"); while my $beaver (split/href/,@content) { if ($beaver =~ m#/files/images/thumb/ ([a-zA-Z0-9]+\.jpg)#gx) { push(@images, "http://www.imagebeaver.com/files/images/$1:$1"); } }
      /g makes the difference
      C:\>perl -le"print $1 while '1234' =~ /(\d)/g" 1 2 3 4 C:\>
Re: regex hangs forever
by stonecolddevin (Parson) on Dec 17, 2006 at 00:32 UTC

    Unless i'm missing something, you never tell us what you're using for your web bot. I'd assume some version of LWP, but i could be wrong. May I suggest something that's not reinventing the wheel (if that's the case), such as WWW::Robot? You pretty much just plug in whatever web page you want retrieved, and it does all the work for you. Then you can move on to HTML::TokeParser to manage your HTML. Hope this helps.

    meh.