tinman has asked for the wisdom of the Perl Monks concerning the following question:

My application uses XML::Parser to parse through a large XML file. At the first pass, all I want to do discover is extract the CDATA between <name></name> tags. A <product_data></product_data> segment contains multiple <name> tags, and I was using a hash to find out unique combinations of <name> values.

Now, intermittently, one element (called PACKAGE_QUANTITY) is split into two entities (PACK and AGE_QUANTITY).. This problem occurs with random elements...
What on earth is going on ?

This is the script I use to extract the tag relationships..

#!/usr/bin/perl -w use strict; use XML::Parser; # initialize hash that will hold header info my $parser = new XML::Parser(ErrorContext => 4,Handlers => {Start => \ +&handle_start, End => \&handle_end, Char => \&handle_char}); my $counter =0; my $ms = 0; my $name = 0; my @tagdesc; my $last_prodid; my $last_prod = 0; my %tags; my @struct; # parse the file whose name we specified as a command-line parameter $parser->parsefile(shift); &write_out; my $odd_cnt = 1; open(DUMP,">dump.txt") or die "No debug file can be opened"; foreach (@struct) { print DUMP; $odd_cnt++; if($odd_cnt == 3) { $odd_c +nt = 1; print DUMP "\n"; } } sub handle_start { my $p = shift; my $el = shift; my %attribs = @_; if($el eq 'product_data') { $counter ++; $ms++;} if($el eq 'product_id') { $last_prod = 1; } if($el eq 'name') { $name = 1; } } sub handle_char { my ($p, $data) = @_; if($name) { push(@tagdesc, $data); if($data =~ m/age_quantity/i) { +push(@struct, $last_prodid); push(@struct, $data); } } if($last_prod) { $last_prodid = $data; } # print $data,"\n" if $counter; } sub handle_end { my $p = shift; my $el = shift; my %atrribs = @_; my $not_written = 0; if($el eq 'product_data') { $counter --; $not_written = 1; if(($ms + % 1000) == 0) { print "$ms...\n"; # &write_out; } } if($el eq 'name') { $name = 0; } if($el eq 'product_id') { $last_prod = 0; } if($not_written) { my $str = join(':',@tagdesc); if($str =~ m/age_quantity/i) { my $a = 0; open(DUMPER1,">>dumper1.txt") or die "No dumper open"; print DUMPER1 "$last_prodid : "; foreach my $element(@tagdesc) { print DUMPER1 "$a: $elemen +t "; $a++; if($element eq 'PACK') { print $last_prodid, "\n"; } } print DUMPER1 "\n"; print DUMPER1 "$str \n"; close DUMPER1; } @tagdesc = (); if(exists $tags{$str}) { my $cnt = $tags{$str}; $cnt++; $tags{ +$str} = $cnt; } else { $tags{$str} = 1; } $str = undef; $not_written = 0; } } sub write_out { open(OUTPUT, ">tag.desc") or die "No open"; foreach my $keyval(keys %tags) { print OUTPUT $keyval, "\n"; } close OUTPUT; }

This is one of the offending product_data segments.

<product_data> <product_id>100000</product_id> <spec> <name>Star</name> <value>Donal McCann|Saskia Reeves|Ciaran Hinds|Patrick Malahide|Br +enda Bruce</value> </spec> <spec> <name>Street Date</name> <value>970506</value> </spec> <spec> <name>Year Released</name> <value>94</value> </spec> <spec> <name>Run Time</name> <value>90 min</value> </spec> <spec> <name>Director</name> <value>Thaddeus O&apos;Sullivan</value> </spec> <spec> <name>Originally Released</name> <value>1993</value> </spec> <spec> <name>Rating</name> <value>Not Rated</value> </spec> <spec> <name>Items</name> <value>1</value> </spec> <spec> <name>MuzeID</name> <value>1060749</value> </spec> <spec> <name>Muze PRelRefNum</name> <value>1</value> </spec> <spec> <name>Categories</name> <value>Dramas, Love, Triangle, Romance, Drama|Dramas|Love Triangle +|Romance|Drama</value> </spec> <spec> <name>Title</name> <value>December Bride</value> </spec> <spec> <name>Format</name> <value>VHS</value> </spec> <spec> <name>First Star</name> <value>Donal McCann</value> </spec> <spec> <name>RUNTIME</name> <value>0090</value> </spec> <spec> <name>STREET_DATE</name> <value>970506</value> </spec> <spec> <name>LAST_UPDATE</name> <value>990701</value> </spec> <spec> <name>ATTRIBUTES</name> <value>C</value> </spec> <spec> <name>YEAR_RELEASED</name> <value>94</value> </spec> <spec> <name>PACKAGE_QUANTITY</name> <value>1</value> </spec> <spec> <name>PREORDER_DATE</name> <value>970415</value> </spec> <spec> <name>MANUFACTURER_PARTNO</name> <value>1166</value> </spec> <spec> <name>UPC</name> <value>720917011660</value> </spec> <spec> <name>SUMMARY</name> <value>SASKIA REEVES</value> </spec> <spec> <name>GENRE</name> <value>DRAMA</value> </spec> <spec> <name>PREBOOK_DATE</name> <value>1997/04/15</value> </spec> <spec> <name>RELEASE_DATE</name> <value>1997/05/06</value> </spec> <spec> <name>ITEM_TYPE</name> <value>S</value> </spec> <spec> <name>STAR1</name> <value>SASKIA REEVES</value> </spec> <spec> <name>STAR2</name> <value>DONAL MC CANN</value> </spec> <spec> <name>SUBTITLE</name> <value>N</value> </spec> <spec> <name>COLORIZED</name> <value>N</value> </spec> <spec> <name>ISBN</name> <value>1572520205</value> </spec> <spec> <name>CLASS_CODE</name> <value>11120</value> </spec> <spec> <name>SETUP_DATE</name> <value>1995/07/17</value> </spec> <spec> <name>LAST_MODIFY</name> <value>1997/11/25</value> </spec> <spec> <name>ITEM_NO</name> <value>FLV 1166V</value> </spec> </product_data>

I apologize for the long code segment and the equally long sample data....

My findings were, when I parsed the entire file, it always failed on one particular element. I then removed the first element of the data (because the offending element seemed to be identical to all the others), and the error disappeared.. I replaced the first element and removed the 2nd and the error appeared elsewhere...

at a loss to explain what is going on here... any comments on how to clean up the code seen above gratefully accepted too...
Thanks

Replies are listed 'Best First'.
Re: XML::Parser, hashes and lists problem
by tinman (Curate) on Apr 14, 2001 at 03:02 UTC
    Update:several hours later, I think it might be a XML::Parser (perhaps expat) bug.. I'm running ActivePerl build 623, based on Perl 5.6, and my XML::Parser version is 2.27. (yes, not the most recent version, I know)

    Reason: I print the $data called by the char_handler subroutine, and it randomly returns AGE_QUANTITY, KAGE_QUANTITY for some elements.

    Would be nice to know if anyone else has encountered this behaviour and/or if its a known bug

Re: XML::Parser, hashes and lists problem
by aardvark (Pilgrim) on Apr 14, 2001 at 20:46 UTC
    It would be helpful to see the error message, then it would be easier to know what the parser is thinking and where/why it is failing.

    Also I'm a little confused where you say " one element (called PACKAGE_QUANTITY) is split into two entities (PACK and AGE_QUANTITY) " Do you mean it is split in the hash key or in the printout? You may want to look at what is going on around the line if($data =~ m/age_quantity/i)Why are you matching on "age_quantity"? Is this a typo or some debugging device?

    What you are calling an element, I think of as an element value. It might be useful to think about how you are structuring your xml. Right now you have tons of these sort of structures:

    <spec> <name>PACKAGE_QUANTITY</name> <value>1</value> </spec> <spec> <name>PREORDER_DATE</name> <value>970415</value> </spec>
    It might be useful to try sturcturing the data like this:
    <spec> <preorder_date>970415</preorder_date> <package_quantity>1</package_quantity> </spec>
    Then you can really use the power of a parser to find certain elements or element values, instead of using regexs. I suspect your problem has more to do with regexs, your counter variables or your control structures than a parser bug.

    Get Strong Together!!