parsing html file

achilles82 has asked for the wisdom of the Perl Monks concerning the following question:

when I am trying to parse a html page my code is repeating the output manytimes,here IR.htm is any html page

#print"content-type: text/html\n\n";

 #require LWP::simple;

 require HTML::TreeBuilder;

 require HTML::FormatText;


#$URL = get("http://www.scriptsocket.com");

$Format = HTML::FormatText->new;

$TreeBuilder = HTML::TreeBuilder->new;

open(FILE,"IR.htm");

#$data =<FILE>;

while(<FILE>){

      chomp $_;

      $TreeBuilder->parse($_);

      $Parsed = $Format->format($TreeBuilder);

      #print "$Parsed"; 
      push(@word,$Parsed);

      
}

foreach(@word){
print $_,"\n";
}



close FILE;
#exit;
[download]

Comment on parsing html file Download Code

Replies are listed 'Best First'.
Re: parsing html file by jdporter (Paladin) on Oct 09, 2008 at 00:27 UTC
What you're forgetting is that the `parse` method of HTML::TreeBuilder (inherited from HTML::Parser) does not process a single, whole HTML document, it processes a chunk; each time you call it, it adds to the current tree. Therefore, you should not do anything with the tree until you're completely done processing the input, after the while loop. So simply move the `$Format->format($TreeBuilder);` [download] line to after the end of the loop. `print HTML::FormatText->new->format( HTML::TreeBuilder->new->parse_fil +e($_)) for @ARGV;` [download] Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: parsing html file
by jdporter (Paladin) on Oct 09, 2008 at 00:27 UTC

What you're forgetting is that the parse method of HTML::TreeBuilder (inherited from HTML::Parser) does not process a single, whole HTML document, it processes a chunk; each time you call it, it adds to the current tree. Therefore, you should not do anything with the tree until you're completely done processing the input, after the while loop. So simply move the

$Format->format($TreeBuilder);
[download]

print HTML::FormatText->new->format( HTML::TreeBuilder->new->parse_fil
+e($_)) for @ARGV;
[download]

Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.

[reply]
[d/l]
[select]