comment on

My application uses XML::Parser to parse through a large XML file. At the first pass, all I want to do discover is extract the CDATA between <name></name> tags. A <product_data></product_data> segment contains multiple <name> tags, and I was using a hash to find out unique combinations of <name> values.

Now, intermittently, one element (called PACKAGE_QUANTITY) is split into two entities (PACK and AGE_QUANTITY).. This problem occurs with random elements...
What on earth is going on ?

This is the script I use to extract the tag relationships..

#!/usr/bin/perl -w
use strict;
use XML::Parser;

# initialize hash that will hold header info
my $parser = new XML::Parser(ErrorContext => 4,Handlers => {Start => \
+&handle_start,
                                          End   => \&handle_end,
                              Char  => \&handle_char});
my $counter =0;
my $ms = 0;
my $name = 0;
my @tagdesc;
my $last_prodid;
my $last_prod = 0;
my %tags;
my @struct;
# parse the file whose name we specified as a command-line parameter
$parser->parsefile(shift);
&write_out;

my $odd_cnt = 1;
open(DUMP,">dump.txt") or die "No debug file can be opened";
foreach (@struct) { print DUMP; $odd_cnt++; if($odd_cnt == 3) { $odd_c
+nt = 1; print DUMP "\n"; } }

sub handle_start {
    my $p = shift;
    my $el = shift;
    my %attribs = @_;
   if($el eq 'product_data') { $counter ++; $ms++;}
   if($el eq 'product_id') { $last_prod = 1; }
   if($el eq 'name') { $name = 1; }
}

sub handle_char {
    my ($p, $data) = @_;
    if($name) { push(@tagdesc, $data); if($data =~ m/age_quantity/i) {
+push(@struct, $last_prodid); push(@struct, $data); }  }
    if($last_prod) { $last_prodid = $data; }
    # print $data,"\n" if $counter;
}

sub handle_end {
    my $p = shift;
    my $el = shift;
    my %atrribs = @_;
    my $not_written = 0;
    if($el eq 'product_data') { $counter --; $not_written = 1; if(($ms
+ % 1000) == 0) { print "$ms...\n"; # &write_out; 
    } }
    if($el eq 'name') { $name = 0;  }
    if($el eq 'product_id') { $last_prod = 0; }
    if($not_written) {
        my $str = join(':',@tagdesc);
        if($str =~ m/age_quantity/i) { 
            my $a = 0;
            open(DUMPER1,">>dumper1.txt") or die "No dumper open"; 
            print DUMPER1 "$last_prodid : ";
            foreach my $element(@tagdesc) { print DUMPER1 "$a: $elemen
+t "; $a++; if($element eq 'PACK') { print $last_prodid, "\n";  } }
            print DUMPER1 "\n";
            print DUMPER1 "$str \n";
            close DUMPER1;
        }
        @tagdesc = ();
        if(exists $tags{$str}) { my $cnt = $tags{$str}; $cnt++; $tags{
+$str} = $cnt; }
        else { $tags{$str} = 1; }
        $str = undef;
        $not_written = 0;
    }
}

sub write_out {
    open(OUTPUT, ">tag.desc") or die "No open";
    foreach my $keyval(keys %tags) { print OUTPUT $keyval, "\n"; }
    close OUTPUT;
}
[download]

This is one of the offending product_data segments.

<product_data>
  <product_id>100000</product_id>
  <spec>
    <name>Star</name>
    <value>Donal McCann|Saskia Reeves|Ciaran Hinds|Patrick Malahide|Br
+enda Bruce</value>
  </spec>
  <spec>
    <name>Street Date</name>
    <value>970506</value>
  </spec>
  <spec>
    <name>Year Released</name>
    <value>94</value>
  </spec>
  <spec>
    <name>Run Time</name>
    <value>90 min</value>
  </spec>
  <spec>
    <name>Director</name>
    <value>Thaddeus O&apos;Sullivan</value>
  </spec>
  <spec>
    <name>Originally Released</name>
    <value>1993</value>
  </spec>
  <spec>
    <name>Rating</name>
    <value>Not Rated</value>
  </spec>
  <spec>
    <name>Items</name>
    <value>1</value>
  </spec>
  <spec>
    <name>MuzeID</name>
    <value>1060749</value>
  </spec>
  <spec>
    <name>Muze PRelRefNum</name>
    <value>1</value>
  </spec>
  <spec>
    <name>Categories</name>
    <value>Dramas, Love, Triangle, Romance, Drama|Dramas|Love Triangle
+|Romance|Drama</value>
  </spec>
  <spec>
    <name>Title</name>
    <value>December Bride</value>
  </spec>
  <spec>
    <name>Format</name>
    <value>VHS</value>
  </spec>
  <spec>
    <name>First Star</name>
    <value>Donal McCann</value>
  </spec>
  <spec>
    <name>RUNTIME</name>
    <value>0090</value>
  </spec>
  <spec>
    <name>STREET_DATE</name>
    <value>970506</value>
  </spec>
  <spec>
    <name>LAST_UPDATE</name>
    <value>990701</value>
  </spec>
  <spec>
    <name>ATTRIBUTES</name>
    <value>C</value>
  </spec>
  <spec>
    <name>YEAR_RELEASED</name>
    <value>94</value>
  </spec>
  <spec>
    <name>PACKAGE_QUANTITY</name>
    <value>1</value>
  </spec>
  <spec>
    <name>PREORDER_DATE</name>
    <value>970415</value>
  </spec>
  <spec>
    <name>MANUFACTURER_PARTNO</name>
    <value>1166</value>
  </spec>
  <spec>
    <name>UPC</name>
    <value>720917011660</value>
  </spec>
  <spec>
    <name>SUMMARY</name>
    <value>SASKIA REEVES</value>
  </spec>
  <spec>
    <name>GENRE</name>
    <value>DRAMA</value>
  </spec>
  <spec>
    <name>PREBOOK_DATE</name>
    <value>1997/04/15</value>
  </spec>
  <spec>
    <name>RELEASE_DATE</name>
    <value>1997/05/06</value>
  </spec>
  <spec>
    <name>ITEM_TYPE</name>
    <value>S</value>
  </spec>
  <spec>
    <name>STAR1</name>
    <value>SASKIA REEVES</value>
  </spec>
  <spec>
    <name>STAR2</name>
    <value>DONAL MC CANN</value>
  </spec>
  <spec>
    <name>SUBTITLE</name>
    <value>N</value>
  </spec>
  <spec>
    <name>COLORIZED</name>
    <value>N</value>
  </spec>
  <spec>
    <name>ISBN</name>
    <value>1572520205</value>
  </spec>
  <spec>
    <name>CLASS_CODE</name>
    <value>11120</value>
  </spec>
  <spec>
    <name>SETUP_DATE</name>
    <value>1995/07/17</value>
  </spec>
  <spec>
    <name>LAST_MODIFY</name>
    <value>1997/11/25</value>
  </spec>
  <spec>
    <name>ITEM_NO</name>
    <value>FLV 1166V</value>
  </spec>
</product_data>
[download]

I apologize for the long code segment and the equally long sample data....

My findings were, when I parsed the entire file, it always failed on one particular element. I then removed the first element of the data (because the offending element seemed to be identical to all the others), and the error disappeared.. I replaced the first element and removed the 2nd and the error appeared elsewhere...

at a loss to explain what is going on here... any comments on how to clean up the code seen above gratefully accepted too...
Thanks

In reply to XML::Parser, hashes and lists problem by tinman

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.