Extracting HTML between comments

pnelson has asked for the wisdom of the Perl Monks concerning the following question:

I need to write a script to parse a collection of html pages and extract a portion of each page, identified by and comments. I would like to use HTML::TokeParser. How can I tell HTML::TokeParser to read the file until it comes to the "start" comment, then parse/output html until it reaches the "end" comment, then exit? My (probably comical) attempt got all text from the document follows:

#!/usr/bin/perl -w
use diagnostics;
use strict;
use HTML::TokeParser;

my $filename = '/Users/peternelson/Desktop/atdstudy.html';
my $stream = HTML::TokeParser->new($filename)
  || die "Couldn't read HTML file $filename: $!";


while  (my $token = $stream->get_token) {

LOOK: {
    if ($token->[0] eq 'C' and $token->[1] eq '<!-- InstanceBeginEdita
+ble name="article" -->') {
        goto PARSE;
        }
    else {
        next;    
    }
}#end look


PARSE: {        
    if ($token->[0] eq 'C' and $token->[1] eq '<!-- InstanceEnd -->') 
+{
        exit;
    }
    elsif ($token->[0] eq 'T') {
        print $token->[1];
    }
    next PARSE;
}#end parse
}
[download]

Thanks!

Comment on Extracting HTML between comments Download Code

Replies are listed 'Best First'.
Re: Extracting HTML between comments by Aristotle (Chancellor) on Sep 01, 2004 at 19:45 UTC
First of all, do yourself a favour and install HTML::TokeParser::Simple. Your code then becomes #!/usr/bin/perl -w use diagnostics; use strict; use HTML::TokeParser::Simple; my $filename = '/Users/peternelson/Desktop/atdstudy.html'; my $stream = HTML::TokeParser::Simple->new( $filename ) \|\| die "Couldn't read HTML file $filename: $!"; while ( my $token = $stream->get_token ) { LOOK: { if ( $token->is_comment and $token->as_is eq '<!-- InstanceBeginEditable name="art +icle" -->' ) { goto PARSE; } else { next; } } PARSE: { if ( $token->is_comment and $token->as_is eq '<!-- InstanceEnd -->' ) { exit; } elsif ( $token->is_text ) { print $token->as_is; } next PARSE; } } [download] That loop doesn't work, because `next` doesn't work that way. You are using it inside a naked block, in which it skips execution of the rest of the block. Since the block only executes once, `next LABEL` is effectively the same as `last LABEL`. It doesn't at all affect execution flow in the surrounding loop, which seems to be what you hoped it'd do. Your problem here is an ideal match for the flip-flop operator: `while ( my $token = $stream->get_token ) { if( ( $token->is_comment and $token->as_is eq '<!-- InstanceBeginEditable name="art +icle" -->' ) .. ( $token->is_comment and $token->as_is eq '<!-- InstanceEnd -->' ) ) { print $token->as_is if $token->is_text; } }` [download] Makeshifts last the longest.	[reply] [d/l] [select]
Re: Extracting HTML between comments by Eimi Metamorphoumai (Deacon) on Sep 01, 2004 at 19:49 UTC
Looks like a good place for the flip-flop `..` operator. So your code would look like this (untested). `while (my $token = $stream->get_token) { if (($token->[0] eq 'C' and $token->[1] eq '<!-- InstanceBeginEditable name="article" -->') .. ($token->[0] eq 'C' and $token->[1] eq '<!-- InstanceEnd -->')) +{ print $token->[1]; } }` [download] Should work (although it might have off-by-one problems, and the style might be confusing). Alternately, you could use a separate variable to keep track of your state. `my $parsing=undef; while (my $token = $stream->get_token) { if ($token->[0] eq 'C' and $token->[1] eq '<!-- InstanceBeginEditable name="article" -->'){ $parsing=1; } elsif ($token->[0] eq 'C' and $token->[1] eq '<!-- InstanceEnd -->') { $parsing=undef; } elsif ($parsing) { print $token->[1]; } }` [download]	[reply] [d/l] [select]