Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse 2 words between <td> brackets: "hello" and "bye"

$text= "<td class=sp></td><td>Hello</td><td class=sp></td><td>Bye</td> +<td class=sp>"; if ($text=~ m/<\/td><td>([^ ]*)<\/./g) {print "$1\n$2\n$3";}

but I only get first one "hello" and cannot parse "bye" word.

Please help

Code and p tags added by GrandFather

Replies are listed 'Best First'.
Re: problem with parsing
by davorg (Chancellor) on Apr 27, 2006 at 09:31 UTC

    First, the standard warning that if you're parsing HTML then you should use an HTML parser rather than regular expressions.

    Your regex only has one set of capturing brackets, so only $1 gets a value. You need something like this:

    if (my @text = $text =~ m|</td><td>([^ ]*)</.|g) { print "@text\n"; }

    Note also the use of alternative delimiters for m// which makes the code more readable.

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: problem with parsing
by marto (Cardinal) on Apr 27, 2006 at 09:30 UTC
Re: problem with parsing
by GrandFather (Saint) on Apr 27, 2006 at 09:33 UTC

    Here's one way. Note though that for any significant amount of HTML parsing you really really should use one of the HTML modules such as HTML::TreeBuilder or HTML::Parser.

    use strict; use warnings; my $text= "<td class=sp></td><td>Hello</td><td class=sp></td><td>Bye</ +td><td class=sp>"; my @matches = $text=~ m|<td>((?:(?!</td>).)*)</td>|g; print "@matches";

    Prints:

    Hello Bye

    DWIM is Perl's answer to Gödel
      thx o lot
Re: problem with parsing
by gellyfish (Monsignor) on Apr 27, 2006 at 10:09 UTC

    As nearly everyone has pointed out you almost certainly want to this with a module such as HTML::Parser, the following is an example of how you would achieve this for your snippet of HTML:

    #!/usr/bin/perl use strict; use warnings; my $text= "<td class=sp></td><td>Hello</td><td class=sp></td><td>Bye</ +td><td class=sp>"; use HTML::Parser; my $parser = HTML::Parser->new( start_h => [ \&start,"self,tag,attr" ] +, start_document_h => [ \&init,"self"]); + $parser->parse($text); foreach my $item ( @{$parser->{_items}} ) { print $item,"\n"; } sub init { my ( $self ) = @_; $self->{_items} = []; } sub start { my ( $self, $tag, $attribs) = @_; if ( $tag eq 'td' && !exists $attribs->{class} ) { $self->handler(text => \&get_text,"self,dtext" ); $self->handler(end => \&end,"self,tag"); } } sub get_text { my ( $self, $text) = @_; $self->{_text} .= $text; } sub end { my ( $self, $tag ) = @_; if ( $tag eq '/td' ) { $self->handler(text => '' ); $self->handler(end => ''); push @{$self->{_items}}, $self->{_text}; $self->{_text} = ''; } }

    /J\

Re: problem with parsing
by prasadbabu (Prior) on Apr 27, 2006 at 09:24 UTC

    Hi Anonymous Monk,

    Use 'while' statement instead of 'if' statement. 'if' statement will match only once. So if you use 'while' statement, you can match more than once. Instead of regex, you can also use some html modules to accomplish your requirement. Also take a look at perlsyn

    while ($text=~ m|<td>((?:(?!</td>).)*)</td>|g) {print "$1\n";}

    Prasad