Re: Retrieving a List of XML Tag Names from Given File

You might want to check out this little snippet I posted here a few months ago: Get a structured tally of XML tags, although I'll admit that it's a tad gnarly as a one-liner on the command line (suitable only for use in a bourne-like shell, such as bash).

Luckily, since then I have refined it into a real script with POD, command-line options and error checking:

#!/usr/bin/env perl

=head1 NAME

xml-structure-hist

=head1 SYNOPSIS

 xml-structure-hist [-r] [-b] file.xml

  -r : have the program supply a root node tag
  -b : show break-downs of element paths (def: raw element counts)

=head1 DESCRIPTION

For any given xml file, this tool will use a standard xml parser to
tabulate the structure of the tags and print (on STDOUT) a tally of
how many times each distinct structural element occurs in the file.

Use the "-r" option if the input file does not include its own "root"
xml tag (e.g. this is typical of Gigaword-style text files, which are
just a concatenation of "<DOC>" elements, with no initial "root" tag
containing all the DOCs).

For example, given an xml file with these contents:

 <root_node>
  <level1 id="x">
   <level2_a><level3>...</level3><level3>...</level3></level2_a>
   <level2_a><level3>...</level3><level3>...</level3></level2_a>
  </level1>
  <level1 id="y">
   <level2_a><level3><level4>...</level4>...</level3></level2_a>
   <level2_b><level3>...</level3></level2_b>
  </level1>
  <level1 id="z">
   <level2_a>...</level2_a>
  </level1>
 </root_node>

the default output would be:

 1      .root_node
 2      .root_node.level1
 4      .root_node.level1.level2_a
 5      .root_node.level1.level2_a.level3
 1      .root_node.level1.level2_a.level3.level4
 1      .root_node.level1.level2_b
 1      .root_node.level1.level2_b.level3

With the "-b" option, the output would be:

 1  .root_node.level1.level2_a
 4  .root_node.level1.level2_a.level3
 1  .root_node.level1.level2_a.level3.level4
 1  .root_node.level1.level2_b.level3

If the example lacked the "root_node" tags, you would use the "-r"
option, and the quantities reported for the "level*" tags would be the
same as above.

=head1 AUTHOR

David Graff <graff at ldc.upenn.edu>

=cut

use strict;
use XML::Parser;

my $Usage = "$0 [-r] [-b] file.xml\n";
my ( $add_root, $discrete_count );
while ( @ARGV > 1 and $ARGV[0] =~ /-([rb])/ ) {
    if ( $1 eq 'r' ) {
        $add_root = shift;
    } else {
        $discrete_count = shift;
    }
}
die $Usage unless ( @ARGV == 1 and -f $ARGV[0] );

my $counter = 0;
my %embedding;
my $key = '';
my %hist;

my $p = XML::Parser->new( Handlers =>
                          { Start => sub{ my $newkey = "$key.$_[1]";
                                          if ( $key and $discrete_coun
+t and
                                               !exists( $embedding{$ke
+y} )) {
                                              $embedding{$key}++;
                                              $hist{$key}--;
                                              $counter++;
                                          }
                                          $key = $newkey;
                                          $hist{$key}++; },
                              End => sub{ delete $embedding{$key} if (
+ $discrete_count );
                                          $key =~ s/\.$_[1]$// },
                        } );
if ( ! $add_root ) {
    $p->parsefile( $ARGV[0] );
}
else {
    my $xmlstr = "<STRUCT_HIST_ROOT_$$>\n";
    open( X, '<:utf8', $ARGV[0] ) or die "Unable to read $ARGV[0]: $!\
+n";
    {
        $/ = undef;
        $xmlstr .= <X>;
    }
    close X;
    $xmlstr .= "</STRUCT_HIST_ROOT_$$>";
    $p->parse( $xmlstr );
}
for my $k ( sort keys %hist ) {
    $_ = $k;
    if ( $add_root ) {
        s/.STRUCT_HIST_ROOT_$$//;
        next unless /\S/;
    }
    print "$hist{$k}\t$_\n" unless ( $discrete_count and $hist{$k} <= 
+0 );
}
[download]

That probably isn't exactly what you're looking for, but it should give you some ideas on how to get what you want.

Comment on Re: Retrieving a List of XML Tag Names from Given File Download Code