regex hangs forever

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to create a small perl web bot to scrape a page but it endlessly just sits there until I run out of memory. Can anyone help me figure out how to fix it?

   my @content = get("$url$cnt");
   my $content = join("\n", @content);

   push(@images, "http://www.imagebeaver.com/files/images/$1:$1") whil
+e $content =~ m#/files/images/thumb/([a-zA-Z0-9]+\.jpg)#g;
[download]

This is a chunk of HTML on the page, I'm just trying to match the /files/images/thumb/(imagename) of the thumbnails and put them in @images

b>Gallery 115564:</b><br><br><a href="http://www.imagebeaver.com/view.
+php?mode=gallery&g=115564&photo=1" target="_blank"><img src="http://w
+ww.imagebeaver.com/files/images/thumb/37021412066526319.jpg" border="
+0"></a> <a href="http://www.imagebeaver.com/view.php?mode=gallery&g=1
+15564&photo=2" target="_blank"><img src="http://www.imagebeaver.com/f
+iles/images/thumb/99821412066526320.jpg" border="0"></a> <a href="htt
+p://www.imagebeaver.com/view.php?mode=gallery&g=115564&photo=3" targe
+t="_blank"><img src="http://www.imagebeaver.com/files/images/thumb/62
+341412066526321.jpg" border="0"></a> <a href="http://www.imagebeaver.
+com/view.php?mode=gallery&g=115564&photo=4" target="_blank"><img src=
+"http://www.imagebeaver.com/files/images/thumb/33641412066526322.jpg"
+ border="0"></a> <a href="http://www.imagebeaver.com/view.php?mode=ga
+llery&g=115564&photo=5" target="_blank"><img src="http://www.imagebea
+ver.com/files/images/thumb/5361412066526323.jpg" border="0"></a> <a 
.
.
.
[download]

Comment on regex hangs forever Select or Download Code

Replies are listed 'Best First'.

Re: regex hangs forever
by grep (Monsignor) on Dec 15, 2006 at 03:52 UTC

highly encouraged

red herring

Try running the code through the perl debugger perl -d to see where it really gets stuck at.

use strict;
use warnings;
use Data::Dumper;
my @images;
my $content = '<a href="http://www.imagebeaver.com/view.php?mode=galle
+ry&g=115564&photo=1" target="_blank"><img src="http://www.imagebeaver
+.com/files/images/thumb/37021412066526319.jpg" border="0"></a> <a hre
+f="http://www.imagebeaver.com/view.php?mode=gallery&g=115564&photo=2"
+ target="_blank"><img src="http://www.imagebeaver.com/files/images/th
+umb/99821412066526320.jpg" border="0"></a> <a href="http://www.imageb
+eaver.com/view.php?mode=gallery&g=115564&photo=3" target="_blank"><im
+g src="http://www.imagebeaver.com/files/images/thumb/6234141206652632
+1.jpg" border="0"></a> <a href="http://www.imagebeaver.com/view.php?m
+ode=gallery&g=115564&photo=4" target="_blank"><img src="http://www.im
+agebeaver.com/files/images/thumb/33641412066526322.jpg" border="0"></
+a> <a href="http://www.imagebeaver.com/view.php?mode=gallery&g=115564
+&photo=5" target="_blank"><img src="http://www.imagebeaver.com/files/
+images/thumb/5361412066526323.jpg" border="0"></a> <a ';

push(@images, "http://www.imagebeaver.com/files/images/$1:$1") while $
+content =~ m#/files/images/thumb/([a-zA-Z0-9]+\.jpg)#g;
print Dumper \@images;
[download]

grep

1)Gain XP 2)??? 3)Profit

[reply]
[d/l]
[select]

Re: regex hangs forever
by Devanchya (Beadle) on Dec 15, 2006 at 04:55 UTC

Reasons:

More than one person has read the code, they most likely thought of more items that could appear on the html page
the overhead is higher, but your not re-inventing the wheel.
What if you want to search for more than one thing in a week? In a year? Do you want to rewrite your search?
Did I mention re-invent the wheel thing?
--

Even smart people are dumb in most things...

[reply]

Re^2: regex hangs forever

by jettero (Monsignor) on Dec 15, 2006 at 13:55 UTC

HTTP::Response

HTML::Parser

UPDATE: I had solved my question above once before, but coudln't find the code in my massive code-scrapbook. What I wanted was HTML::TreeBuilder. It is the bomb.

-Paul

[reply]

Re^3: regex hangs forever

by jasonk (Parson) on Dec 15, 2006 at 18:32 UTC

If you are looking to get data from HTML tables, HTML::TableExtract can often do it with much less work than using something like HTML::Parser or HTML::TreeBuilder directly, especially if the tables are complex...

We're not surrounded, we're in a target-rich environment!

[reply]

Re: regex hangs forever
by vladdrak (Monk) on Dec 15, 2006 at 05:10 UTC

my @content = get("$url$cnt");

while my $beaver (split/href/,@content) {
   if ($beaver =~ m#/files/images/thumb/
                    ([a-zA-Z0-9]+\.jpg)#gx) {
     push(@images, 
     "http://www.imagebeaver.com/files/images/$1:$1");
   }
}
[download]

[reply]
[d/l]

Re^2: regex hangs forever

by Anonymous Monk on Dec 15, 2006 at 05:25 UTC

C:\>perl -le"print $1 while '1234' =~ /(\d)/g"
1
2
3
4

C:\>
[download]

[reply]
[d/l]

Re: regex hangs forever
by stonecolddevin (Parson) on Dec 17, 2006 at 00:32 UTC

Unless i'm missing something, you never tell us what you're using for your web bot. I'd assume some version of LWP, but i could be wrong. May I suggest something that's not reinventing the wheel (if that's the case), such as WWW::Robot? You pretty much just plug in whatever web page you want retrieved, and it does all the work for you. Then you can move on to HTML::TokeParser to manage your HTML. Hope this helps.

meh.

[reply]