Re: Parsing generic XML

Since (according to one of your replies above) the xml input is "variable", you might be interested in the following, which I wrote a while back just to be able to summarize xml tag structures in a generic way.

I prefer "low level" xml modules like XML::Parser and XML::LibXML, because for some reason I find that they are actually easier for me to learn, compared to the "refined sugar" approaches like XML::Simple and XML::Twig; I don't mind writing a few extra lines of code, given that I'm able to understand more quickly what the code is really doing.

As for going beyond simple summarization and updating content, I think LibXML would be the tool I'd prefer.

#!/usr/bin/perl

use strict;
use XML::Parser;

my $Usage = "$0 [-r] [-b] file.xml\n";
my ( $add_root, $count_attribs, $discrete_count );
while ( @ARGV > 1 and $ARGV[0] =~ /^-([abr])$/ ) {
    if ( $1 eq 'r' ) {
        $add_root = shift;
    }
    elsif ( $1 eq 'a' ) {
        $count_attribs = shift;
    }
    else {
        $discrete_count = shift;
    }
}
die $Usage unless ( @ARGV == 1 and -f $ARGV[0] );

my %embedding;
my $key = '';
my %ehist;
my %ahist;

my $p = XML::Parser->new( Handlers =>
                          { Start => sub{ my $newkey = "$key/$_[1]";
                                          if ( $key and $discrete_coun
+t and
                                               !exists( $embedding{$ke
+y} )) {
                                              $embedding{$key}++;
                                              $ehist{$key}--;
                                          }
                                          $key = $newkey;
                                          $ehist{$key}++;
                                          if ( $count_attribs ) {
                                              for ( my $i=2; $i<$#_; $
+i+=2 ) {
                                                  $ahist{$key}{$_[$i]}
+++;
                                              }
                                          }
                                     },
                              End => sub{ delete $embedding{$key} if (
+ $discrete_count );
                                          $key =~ s{/$_[1]$}{} },
                        } );
if ( ! $add_root ) {
    $p->parsefile( $ARGV[0] );
}
else {
    my $xmlstr = "<STRUCT_HIST_ROOT_$$>\n";
    open( X, '<:utf8', $ARGV[0] ) or die "Unable to read $ARGV[0]: $!\
+n";
    {
        local $/ = undef;
        $xmlstr .= <X>;
    }
    close X;
    $xmlstr .= "</STRUCT_HIST_ROOT_$$>";
    $p->parse( $xmlstr );
}
for my $k ( sort keys %ehist ) {
    $_ = $k;
    if ( $add_root ) {
        s{/STRUCT_HIST_ROOT_$$}{};
        next unless /\S/;
    }
    next if ( $discrete_count and $ehist{$k} <= 0 );
    print "$ehist{$k}\t$_\n";
    if ( $count_attribs ) {
        print "\t$ahist{$k}{$_}\t\@$_\n" for ( sort keys %{$ahist{$k}}
+ );
    }
}

=head1 NAME

xml-structure-hist

=head1 SYNOPSIS

 xml-structure-hist [-r] [-a] [-b] file.xml

  -r : have the program supply a root node tag
  -a : tabulate element attributes (only on raw element counts)
  -b : count only "bottom-level" paths (def: also count intermed.paths
+ )

=head1 DESCRIPTION

For any given xml file, this tool will use a standard xml parser to
tabulate the structure of the tags and print (on STDOUT) a tally of
how many times each distinct structural element occurs in the file.

Use the "-r" option if the input file does not include its own "root"
xml tag (e.g. when multiple blocks of similar xml data are concatenate
+d
without a wrapper tag being put around them).

For example, given an xml file with these contents:

 <root_node>
  <level1 id="x">
   <level2_a><level3 x="y">...</level3><level3>...</level3></level2_a>
   <level2_a><level3 x="z">...</level3><level3>...</level3></level2_a>
  </level1>
  <level1 id="y">
   <level2_a><level3 x="w"><level4>...</level4>...</level3></level2_a>
   <level2_b><level3 x="x">...</level3></level2_b>
  </level1>
  <level1 id="z">
   <level2_a>...</level2_a>
  </level1>
 </root_node>

the default output would be:

 1      /root_node
 3      /root_node/level1
 4      /root_node/level1/level2_a
 5      /root_node/level1/level2_a/level3
 1      /root_node/level1/level2_a/level3/level4
 1      /root_node/level1/level2_b
 1      /root_node/level1/level2_b/level3

With tha "-a" option, the output would be:

 1      /root_node
 3      /root_node/level1
        3       @id
 4      /root_node/level1/level2_a
 5      /root_node/level1/level2_a/level3
        3       @x
 1      /root_node/level1/level2_a/level3/level4
 1      /root_node/level1/level2_b
 1      /root_node/level1/level2_b/level3
        1       @x

With the "-b" option, the output would be:

 1  /root_node/level1/level2_a
 4  /root_node/level1/level2_a/level3
 1  /root_node/level1/level2_a/level3/level4
 1  /root_node/level1/level2_b/level3

If the example lacked the "root_node" tags, you would use the "-r"
option, and the quantities reported for the "level*" tags would be the
same as above.

=head1 AUTHOR

David Graff <graff at ldc.upenn.edu>

=cut
[download]

Comment on Re: Parsing generic XML Download Code