problem with parsing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: problem with parsing by davorg (Chancellor) on Apr 27, 2006 at 09:31 UTC
First, the standard warning that if you're parsing HTML then you should use an HTML parser rather than regular expressions. Your regex only has one set of capturing brackets, so only $1 gets a value. You need something like this: `if (my @text = $text =~ m\|</td><td>([^ ])</.\|g) { print "@text\n"; }` [download] Note also the use of alternative delimiters for m// which makes the code more readable. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club."* -- Chip Salzenberg	[reply] [d/l]
Re: problem with parsing by marto (Cardinal) on Apr 27, 2006 at 09:30 UTC
Hi Anonymous Monk, have you considered using a module such as HTML::TokeParser::Simple to complete this task. Check out the Tutorials section of this site should you need help installing modules. Hope this helps. Also please read the PerlMonks FAQ and How do I post a question effectively?. Martin	[reply]
Re: problem with parsing by GrandFather (Saint) on Apr 27, 2006 at 09:33 UTC
Here's one way. Note though that for any significant amount of HTML parsing you really really should use one of the HTML modules such as HTML::TreeBuilder or HTML::Parser. `use strict; use warnings; my $text= "<td class=sp></td><td>Hello</td><td class=sp></td><td>Bye</ +td><td class=sp>"; my @matches = $text=~ m\|<td>((?:(?!</td>).)*)</td>\|g; print "@matches";` [download] Prints: `Hello Bye` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: problem with parsing by Anonymous Monk on Apr 27, 2006 at 09:55 UTC
thx o lot	[reply]
Re: problem with parsing by gellyfish (Monsignor) on Apr 27, 2006 at 10:09 UTC
As nearly everyone has pointed out you almost certainly want to this with a module such as HTML::Parser, the following is an example of how you would achieve this for your snippet of HTML: #!/usr/bin/perl use strict; use warnings; my $text= "<td class=sp></td><td>Hello</td><td class=sp></td><td>Bye</ +td><td class=sp>"; use HTML::Parser; my $parser = HTML::Parser->new( start_h => [ \&start,"self,tag,attr" ] +, start_document_h => [ \&init,"self"]); + $parser->parse($text); foreach my $item ( @{$parser->{_items}} ) { print $item,"\n"; } sub init { my ( $self ) = @_; $self->{_items} = []; } sub start { my ( $self, $tag, $attribs) = @_; if ( $tag eq 'td' && !exists $attribs->{class} ) { $self->handler(text => \&get_text,"self,dtext" ); $self->handler(end => \&end,"self,tag"); } } sub get_text { my ( $self, $text) = @_; $self->{_text} .= $text; } sub end { my ( $self, $tag ) = @_; if ( $tag eq '/td' ) { $self->handler(text => '' ); $self->handler(end => ''); push @{$self->{_items}}, $self->{_text}; $self->{_text} = ''; } } [download] /J\	[reply] [d/l]
Re: problem with parsing by prasadbabu (Prior) on Apr 27, 2006 at 09:24 UTC
Hi Anonymous Monk, Use 'while' statement instead of 'if' statement. 'if' statement will match only once. So if you use 'while' statement, you can match more than once. Instead of regex, you can also use some html modules to accomplish your requirement. Also take a look at perlsyn `while ($text=~ m\|<td>((?:(?!</td>).)*)</td>\|g) {print "$1\n";}` [download] Prasad	[reply] [d/l]