webchalkboard has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

Can someone help me out.

I have a page of HTML and I need to grab out certain bits of it. I can identify the bits I need by the format of the HTML because it is generated from a database.

My problem is that I need a regular expression which takes all occurances of the particular string and puts it in an array.

If there was only one occurance I wanted to catch I would use the following regular expression:

$line =~ m/<B><P ALIGN\=CENTER>(.+)<\/B><\/FONT>/g; my $bit_i_want=$1;

$1 holds the bit I want, but how can I get the regular expression to hold all occurances rather than just the first? I'm sure there is a clever regex way of doing this.

Thanks,
Tom

Learning without thought is labor lost; thought without learning is perilous. - Confucius
WebChalkboard.com | For the love of art...

Replies are listed 'Best First'.
Re: Pattern Matching
by tphyahoo (Vicar) on Mar 09, 2005 at 12:05 UTC
    This should do it... I got surprised the $1 behaves differently if you loop through the array returned by /g matching, or if you stick the result of =~ in a while loop. Updated with various thoughts about this.
    use warnings; use strict; use Data::Dumper; # for debugging local $/ = ""; # input separator was newline, but now it's gone -> slu +rp mode. my $html; while (<DATA>) { $html = $_; } print "Html: $html\n\n"; # just to check that this worked... it works. #Don't do this, you can't use the $1, $2 type special variables. my @matches = $html =~ m|(<B><P ALIGN=CENTER>(.+)</B></FONT>)|g; # use + | as regex delimitor to avoid leaning toothpick syndrome. print "Dumper\n" . Dumper(\@matches); # bit o debugging #Do this. Then you can access the special vars. while ($html =~ m|(<B><P ALIGN=CENTER>(.+)</B></FONT>)|g) { print "match $2\n"; # don't know why this works. $1 doesn't work. + Hm... } #outputs #match blah #match foo #match gah #$line =~ m/<B><P ALIGN\=CENTER>(.+)<\/B><\/FONT>/gm; #my $bit_i_want=$1; __DATA__ <B><P ALIGN=CENTER>blah</B></FONT> <B><P ALIGN=CENTER>foo</B></FONT> <B><P ALIGN=CENTER>gah</B></FONT>
    Maybe you should be using HTML::TokeParser to do html matching, regex matching on html doesn't scale too well.
Re: Pattern Matching
by lidden (Curate) on Mar 09, 2005 at 11:49 UTC
    @array = $line =~ m/<B><P ALIGN\=CENTER>(.+)<\/B><\/FONT>/g;
    or
    push @array, $line =~ m/<B><P ALIGN\=CENTER>(.+)<\/B><\/FONT>/g;
    should work. Depending on what you are doing.

      Thanks, thats what I already had, but those only seem to get the first occurance... then adds the rest of the page after it. Almost like the second part of the regex isn't telling it when to stop...

      Here is the pattern i'm trying to match, have I got the regex wrong?

      some random text<B><P ALIGN=CENTER>My Text I want to grab 1</B></FONT> +<p>some of stuff<B><P ALIGN=CENTER>My Text I want to grab 2</B></FONT +> some other random text<B><P ALIGN=CENTER>My Text I want to grab 3</ +B></FONT>

      Any ideas?

      Thanks again

      Learning without thought is labor lost; thought without learning is perilous. - Confucius
      WebChalkboard.com | For the love of art...
        $_ = 'some random text<B><P ALIGN=CENTER>My Text I want to grab 1</B>< +/FONT><p>some of stuff<B><P ALIGN=CENTER>My Text I want to grab 2</B> +</FONT> some other random text<B><P ALIGN=CENTER>My Text I want to gr +ab 3</B></FONT>'; my @wants = /<B><P ALIGN\=CENTER>(.+?)<\/B><\/FONT>/g; print "$_\n" for @wants;
        Note non-greedy match.

        Caution: Contents may have been coded under pressure.
        Just for the record, that's ill-formed html in your example:
        <B><P ALIGN=CENTER>My Text I want to grab 1</B></FONT><p>
        most of the issues will be ignored by most browsers, but are you certain? If displayed !eq actual, regex will have a hard time

        (and, OT: writing 4.01 compliant code and using css IS worth the trouble -- for anything you're putting online, anyway)

        Directly on your question: Note that your pattern is greedy, (.+), aka one-or-more-of-anything_except_a_newline. As written, $line =~ m/<B><P ALIGN\=CENTER>(.+)<\/B><\/FONT>/g; that should swallow everything UP TO the last end_bold_new_para tags.

        And re capturing all instances: suggest you push to an array inside a foreach loop. The array subscript will be incremented for each new item found.

Re: Pattern Matching
by Anonymous Monk on Mar 09, 2005 at 12:38 UTC
    You .+ is greedy. Use:
    @matches = $line =~ m{<B><P \s+ ALIGN=CENTER> ([^<]*(?:<(?!/B></FONT>)[^<]*)*) </B></FONT>}gx;
Re: Pattern Matching
by sh1tn (Priest) on Mar 09, 2005 at 12:34 UTC
    In addition (simple example):
    use strict; my @patterns; my $pattern = qr{^(\w+):\w+}; do{ push @patterns, $1 while /$pattern/g }while<DATA>; print "@patterns"; __DATA__ user_one:unused user_two:unused user_three:unused __END__ STDOUT: user_one user_two user_three