Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I must have done something silly. If I run this, my whole computer locks up for a few minutes until I can kill the CMD window. Can't quite put my finger on why it won't do what I'm trying to make it do..
my $source = get("www.perlmonks.org"); # for example, can be any domai +n push(@titles, $1) while $source =~ m/<title>(.+)<\/title>/i; print join("\n", @titles);

PS: please don't comment on "Oh he's regexing HTML!". I'm not looking to see why the match doesn't work, just why it's doing an infite loop or whatever is causing my CPU to cry when it's run.

Replies are listed 'Best First'.
Re: one line regex eating CPU
by GrandFather (Saint) on Jun 23, 2006 at 20:07 UTC

    while puts the match in a scalar context and loops until killed by some external factor (every time you try the match succeeds unless it never does).

    Try this to achieve what you are after:

    use strict; use warnings; my @titles; my $source = "<title>www.perlmonks.org</title><title>somewhere else</t +itle>"; push @titles, $source =~ m/<title>(.+?)<\/title>/ig; print join("\n", @titles);

    Prints:

    www.perlmonks.org somewhere else

    Note too that you need to non-greedy match (.+?), and as you know, this will very likely come unstuck used on HTML. :)


    DWIM is Perl's answer to Gödel
      Thanks, that did fix it. But I swear I did this before where I did push(@array, $1) ... because I actually like that form better than push @array, ..

      How would this method match $1 AND $2 then (assuming we had a second capture going on)?

        Something like this may do what you want:

        use strict; use warnings; my @titles; my $source = <<TEXT; <title id='1'>www.perlmonks.org</title> <title id='2'>somewhere else</title> TEXT push @titles, $source =~ m/<title\s+id='(\d+)'>(.+?)<\/title>/igs; while (@titles) { my @pair = splice @titles, 0, 2; print "$pair[0]: $pair[1]\n"; }

        Prints:

        1: www.perlmonks.org 2: somewhere else

        DWIM is Perl's answer to Gödel
Re: one line regex eating CPU
by ikegami (Patriarch) on Jun 23, 2006 at 20:14 UTC

    Replace
    push(@titles, $1) while $source =~ m/<title>(.+)<\/title>/i;
    with
    push(@titles, $1) while $source =~ m/<title>(.+?)<\/title>/ig;
    to 1) avoid matching from the begining of $source every time, and 2) to avoid matching too much.

    push(@titles, $source =~ m/<title>(.+?)<\/title>/i);
    also works. It might even be faster. However, it uses more memory.

    Update: Actually, there should be at most one title, so you want
    push(@titles, $1) if $source =~ m/<title>(.+?)<\/title>/i;

      this ikegami's advice seems to be better than mine - don't reset the regex-engine.

        Hum? Yours "resets the regex-engine". Mine doesn't. How can you say we gave the same advice?

        >perl -wle "$_='bacada'; print pos while /a/g" 2 4 6 >perl -wle "$_='bacada'; print pos while s/a//" Use of uninitialized value in print at -e line 1. Use of uninitialized value in print at -e line 1. Use of uninitialized value in print at -e line 1.
Re: one line regex eating CPU
by shmem (Chancellor) on Jun 23, 2006 at 20:13 UTC
    Don't m//, do s///.
    push(@titles, $1) while $source =~ s/<title>(.+)<\/title>//i;
    You must weed out what you've gathered so far, or you will get the same first match forever.
    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}