Peamasii has asked for the wisdom of the Perl Monks concerning the following question:

Hello all I'm using perl 5.8.0. and HTML::Parser 3.26. I'm parsing a file with many links and just extracting the links. The problem that occurs consistently is that some of the link titles get truncated right in the Parser subroutine. Only about 10% of the links get truncated like this. I examined the input file with a hex editor and was unable to detect any reason for this to happen. If you run the code, most of the lines printed out with EditLine contain incomplete link title output in index.xml. Here is the input file and the code (sliparser.pl):

input data (Save file)

run command:> perl sliparser.pl prodnr.html number

written output with incomplete links:> index.xml

#!/usr/bin/perl use strict; # define the subclass package ProcExternal; use base "HTML::Parser"; #specifying filename to open if (($ARGV[0] eq "?") || ($#ARGV != 1)) {die "usage $0 file_name orde +ring (name/number)\n";}; my $file_name=$ARGV[0]; my $order = $ARGV[1]; my $skip = 1; my ($product_url, $product_bigurl, $product_id, $product_name, $dir_na +me); my (@tarray, @sarray); my ($orig_text, $orig_self, $product_line); @sarray = split /\\/,$file_name; pop @sarray; $dir_name = join ('\\', @sarray); if (!open(OFILE,">index.xml")) { die "Can't open $product_name: $!"; } +; print OFILE "<Products>\n"; &proc_html($file_name); print OFILE "</Products>\n"; close OFILE; sub text { my ($self, $text) = @_; $orig_self = $self; $orig_text = $text; if (!$skip) { @tarray = split(/ /,$text); if ($order eq "name") { $product_id = pop @tarray; } elsif ($order eq "number") { $product_id = shift @tarray; } elsif ($order eq "none") { $product_id = "999"; } else { die "invalid ordering parameter\n"; } if ($product_id !~/[0-9]/) { print "number format error: $text\n"; } foreach (@tarray) { $_ =~ s/\s+//g; } $product_name = join(' ',@tarray); $product_name =~ s/^\s//g; if (($product_id) && ($product_name)) { print OFILE "<Product>\n"; if ($#tarray < 2) { print OFILE "\t<EditLine>$orig_text</EditLine>\n"; print "EditLine #: $product_id $product_name\n"; }; print OFILE "\t<Name>$product_name</Name>\n"; print OFILE "\t<PDF>$product_url</PDF>\n"; print OFILE "\t<Number>$product_id</Number>\n"; print OFILE "</Product>\n\n"; # } } $skip = 1; } } sub comment { my ($self, $comment) = @_; } sub start { my ($self, $tag, $attr, $attrseq, $origtext) = @_; if ($tag eq "a") { $skip = 0; $product_line = $origtext; $product_bigurl = $attr->{href}; @sarray = split('/',$product_bigurl); $product_url = uc(pop @sarray); } } sub end { my ($self, $tag, $origtext) = @_; if ($tag eq "a") { # print $origtext; } } sub proc_html { my $htmlcontent = shift (@_); my $p = new ProcExternal; $p->parse_file($htmlcontent); $p->eof; return; }

edit (broquaint): added <readmore>

Replies are listed 'Best First'.
Re: HTML::Parser problem
by matija (Priest) on Mar 29, 2004 at 08:40 UTC
    HTML::Parser does not guarantee that you will get the whole text in a link in a single chunk. You could try to see if enabling $p->unbroken_text(1) will solve your problem.

    The other (IMHO better) solution is to move the processing of the text from your text subroutine into your end subroutine.

    (Incidentaly, I just noticed you forgot to set skip back to 1 in your end subroutine...

      I appreciate your suggestion, especially since it fixes the problem ;-)

      Yes, the $p->unbroken_text(1) solves the problem... and I made a note about moving the processing. The "skip" is fine because each input file follows a hardcoded structure.

      As a matter of fact, after reading some snippets here, I re-wrote the code using HTML::TokeParser and it works perfectly and it's much simpler this way :-)

Re: HTML::Parser problem
by Peamasii (Sexton) on Mar 29, 2004 at 08:35 UTC

    The problem can be seen as soon in the output index.xml if you look for the line <Number>3002240</Number>. The element

    <Name>INSAVER</Name>

    should be

    <Name>INSAVER 150TOPP 10/13W 3023460</Name>

    Just as an added note, I am now re-writing this code using HTML::LinkExtractor and am not seeing the problem any longer. Very weird though, if anyone has any idea why, it would be interesting to know!

      Maybe I am over simplifying this, but all those modules and all that code looks a little overkill, if the HTML file mentioned is the only file. Something quick'n'dirty like this would do too (to start with, of course), right?

      $\="\n"; print "<Products>"; open FH, "<prodnr.html"; while(<FH>) { if($_ =~ /HREF="(.*pdf)".*(\d{7}) ([^<]+)/) { print "<Product>"; print "\t<Name>$3</Name>"; print "\t<PDF>$1</PDF>"; print "\t<Number>$2</Number>"; print "</Product>"; } } close FH; print "</Products>";
      --
      b10m

      All code is usually tested, but rarely trusted.

        You're not oversimplifying since using regex is hardly simpler to me, than using HTML:Parser or its brethren ;-)

        Point well taken though!