My application uses XML::Parser to parse through a large XML file. At the first pass, all I want to do discover is extract the CDATA between <name></name> tags. A <product_data></product_data> segment contains multiple <name> tags, and I was using a hash to find out unique combinations of <name> values.

Now, intermittently, one element (called PACKAGE_QUANTITY) is split into two entities (PACK and AGE_QUANTITY).. This problem occurs with random elements...
What on earth is going on ?

This is the script I use to extract the tag relationships..

#!/usr/bin/perl -w use strict; use XML::Parser; # initialize hash that will hold header info my $parser = new XML::Parser(ErrorContext => 4,Handlers => {Start => \ +&handle_start, End => \&handle_end, Char => \&handle_char}); my $counter =0; my $ms = 0; my $name = 0; my @tagdesc; my $last_prodid; my $last_prod = 0; my %tags; my @struct; # parse the file whose name we specified as a command-line parameter $parser->parsefile(shift); &write_out; my $odd_cnt = 1; open(DUMP,">dump.txt") or die "No debug file can be opened"; foreach (@struct) { print DUMP; $odd_cnt++; if($odd_cnt == 3) { $odd_c +nt = 1; print DUMP "\n"; } } sub handle_start { my $p = shift; my $el = shift; my %attribs = @_; if($el eq 'product_data') { $counter ++; $ms++;} if($el eq 'product_id') { $last_prod = 1; } if($el eq 'name') { $name = 1; } } sub handle_char { my ($p, $data) = @_; if($name) { push(@tagdesc, $data); if($data =~ m/age_quantity/i) { +push(@struct, $last_prodid); push(@struct, $data); } } if($last_prod) { $last_prodid = $data; } # print $data,"\n" if $counter; } sub handle_end { my $p = shift; my $el = shift; my %atrribs = @_; my $not_written = 0; if($el eq 'product_data') { $counter --; $not_written = 1; if(($ms + % 1000) == 0) { print "$ms...\n"; # &write_out; } } if($el eq 'name') { $name = 0; } if($el eq 'product_id') { $last_prod = 0; } if($not_written) { my $str = join(':',@tagdesc); if($str =~ m/age_quantity/i) { my $a = 0; open(DUMPER1,">>dumper1.txt") or die "No dumper open"; print DUMPER1 "$last_prodid : "; foreach my $element(@tagdesc) { print DUMPER1 "$a: $elemen +t "; $a++; if($element eq 'PACK') { print $last_prodid, "\n"; } } print DUMPER1 "\n"; print DUMPER1 "$str \n"; close DUMPER1; } @tagdesc = (); if(exists $tags{$str}) { my $cnt = $tags{$str}; $cnt++; $tags{ +$str} = $cnt; } else { $tags{$str} = 1; } $str = undef; $not_written = 0; } } sub write_out { open(OUTPUT, ">tag.desc") or die "No open"; foreach my $keyval(keys %tags) { print OUTPUT $keyval, "\n"; } close OUTPUT; }

This is one of the offending product_data segments.

<product_data> <product_id>100000</product_id> <spec> <name>Star</name> <value>Donal McCann|Saskia Reeves|Ciaran Hinds|Patrick Malahide|Br +enda Bruce</value> </spec> <spec> <name>Street Date</name> <value>970506</value> </spec> <spec> <name>Year Released</name> <value>94</value> </spec> <spec> <name>Run Time</name> <value>90 min</value> </spec> <spec> <name>Director</name> <value>Thaddeus O&apos;Sullivan</value> </spec> <spec> <name>Originally Released</name> <value>1993</value> </spec> <spec> <name>Rating</name> <value>Not Rated</value> </spec> <spec> <name>Items</name> <value>1</value> </spec> <spec> <name>MuzeID</name> <value>1060749</value> </spec> <spec> <name>Muze PRelRefNum</name> <value>1</value> </spec> <spec> <name>Categories</name> <value>Dramas, Love, Triangle, Romance, Drama|Dramas|Love Triangle +|Romance|Drama</value> </spec> <spec> <name>Title</name> <value>December Bride</value> </spec> <spec> <name>Format</name> <value>VHS</value> </spec> <spec> <name>First Star</name> <value>Donal McCann</value> </spec> <spec> <name>RUNTIME</name> <value>0090</value> </spec> <spec> <name>STREET_DATE</name> <value>970506</value> </spec> <spec> <name>LAST_UPDATE</name> <value>990701</value> </spec> <spec> <name>ATTRIBUTES</name> <value>C</value> </spec> <spec> <name>YEAR_RELEASED</name> <value>94</value> </spec> <spec> <name>PACKAGE_QUANTITY</name> <value>1</value> </spec> <spec> <name>PREORDER_DATE</name> <value>970415</value> </spec> <spec> <name>MANUFACTURER_PARTNO</name> <value>1166</value> </spec> <spec> <name>UPC</name> <value>720917011660</value> </spec> <spec> <name>SUMMARY</name> <value>SASKIA REEVES</value> </spec> <spec> <name>GENRE</name> <value>DRAMA</value> </spec> <spec> <name>PREBOOK_DATE</name> <value>1997/04/15</value> </spec> <spec> <name>RELEASE_DATE</name> <value>1997/05/06</value> </spec> <spec> <name>ITEM_TYPE</name> <value>S</value> </spec> <spec> <name>STAR1</name> <value>SASKIA REEVES</value> </spec> <spec> <name>STAR2</name> <value>DONAL MC CANN</value> </spec> <spec> <name>SUBTITLE</name> <value>N</value> </spec> <spec> <name>COLORIZED</name> <value>N</value> </spec> <spec> <name>ISBN</name> <value>1572520205</value> </spec> <spec> <name>CLASS_CODE</name> <value>11120</value> </spec> <spec> <name>SETUP_DATE</name> <value>1995/07/17</value> </spec> <spec> <name>LAST_MODIFY</name> <value>1997/11/25</value> </spec> <spec> <name>ITEM_NO</name> <value>FLV 1166V</value> </spec> </product_data>

I apologize for the long code segment and the equally long sample data....

My findings were, when I parsed the entire file, it always failed on one particular element. I then removed the first element of the data (because the offending element seemed to be identical to all the others), and the error disappeared.. I replaced the first element and removed the 2nd and the error appeared elsewhere...

at a loss to explain what is going on here... any comments on how to clean up the code seen above gratefully accepted too...
Thanks


In reply to XML::Parser, hashes and lists problem by tinman

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.