hiddengeek has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, This is my first go around with RE's, so please go gentle.
open(INPUT, "ml_test.html"); while (<INPUT>) { $text .= $_; } close(INPUT); if ($text =~ /\bclb_new>\b(.*?)\b<\/a>/g) { print $1; }
in the ml_test.html, there is some data that is laid out like:
...target=clb_new>DATAINEED</a></font>... ...target=clb_new>MOREDATAINEED</a></font>...
Above my code gets the "DATAINEED" but not the "MOREDATAINEED". I have tried different /g placements, but to no avail. Thank you for your help.
ylg

Replies are listed 'Best First'.
(sacked) Re: global issue with an RE, I think...
by sacked (Hermit) on Nov 17, 2001 at 02:22 UTC
    An alternative approach is to use HTML::TokeParser to extract the 'linked' text from anchors:
    #!/usr/bin/perl -w use strict; use HTML::TokeParser; my $p = HTML::TokeParser->new('/tmp/ml_test.html'); while ( my $t = $p->get_tag('a')) { print $p->get_trimmed_text('/a'), "\n"; }
    --sacked
Re: global issue with an RE, I think...
by converter (Priest) on Nov 17, 2001 at 01:49 UTC

    I see two problems

    1. You're closing the input filehandle inside the loop after reading the first record from it. The while loop ends after the first record because the read on the closed filehandle returns false.
    2. $text .= $_; appends each input line to $text.

    Try something like this:

    while (<INPUT>) { while ( m!clb_new>([^<]+)</a>!g ) { print $1; } }

    conv

    Update:

    • I should change my nick to "Update"
    • I misread your code. You were appending the input to $text, then parsing it. The code I posted should work though.

Re: global issue with an RE, I think...
by tadman (Prior) on Nov 17, 2001 at 01:54 UTC
    The /g modifier when used in conjunction with an if is a little unusual. Typically you would see things like 'if (s/x/y/g)' or 'if (/x/)'. The /g means to do a global match, and it will return an array of applicable matches, if given the opportunity.

    Maybe you are intending to write something like this:
    open(INPUT, "ml_test.html"); # Read in the entire file into a single string. my $text = join ('', <INPUT>); close(INPUT); while ($text =~ /\bclb_new>\b(.*?)\b<\/a>/g) { print $1; }
Re: global issue with an RE, I think...
by impossiblerobot (Deacon) on Nov 17, 2001 at 01:55 UTC
    The biggest problem seems to be the fact that you are only printing the first match ($1). If you replace 'if' with 'while' you will get closer to what you want.

    while ($text =~ /\bclb_new>\b(.*?)\b<\/a>/g) { print $1; }


    Impossible Robot