comment on

You might want to check out this little snippet I posted here a few months ago: Get a structured tally of XML tags, although I'll admit that it's a tad gnarly as a one-liner on the command line (suitable only for use in a bourne-like shell, such as bash).

Luckily, since then I have refined it into a real script with POD, command-line options and error checking:

#!/usr/bin/env perl

=head1 NAME

xml-structure-hist

=head1 SYNOPSIS

 xml-structure-hist [-r] [-b] file.xml

  -r : have the program supply a root node tag
  -b : show break-downs of element paths (def: raw element counts)

=head1 DESCRIPTION

For any given xml file, this tool will use a standard xml parser to
tabulate the structure of the tags and print (on STDOUT) a tally of
how many times each distinct structural element occurs in the file.

Use the "-r" option if the input file does not include its own "root"
xml tag (e.g. this is typical of Gigaword-style text files, which are
just a concatenation of "<DOC>" elements, with no initial "root" tag
containing all the DOCs).

For example, given an xml file with these contents:

 <root_node>
  <level1 id="x">
   <level2_a><level3>...</level3><level3>...</level3></level2_a>
   <level2_a><level3>...</level3><level3>...</level3></level2_a>
  </level1>
  <level1 id="y">
   <level2_a><level3><level4>...</level4>...</level3></level2_a>
   <level2_b><level3>...</level3></level2_b>
  </level1>
  <level1 id="z">
   <level2_a>...</level2_a>
  </level1>
 </root_node>

the default output would be:

 1      .root_node
 2      .root_node.level1
 4      .root_node.level1.level2_a
 5      .root_node.level1.level2_a.level3
 1      .root_node.level1.level2_a.level3.level4
 1      .root_node.level1.level2_b
 1      .root_node.level1.level2_b.level3

With the "-b" option, the output would be:

 1  .root_node.level1.level2_a
 4  .root_node.level1.level2_a.level3
 1  .root_node.level1.level2_a.level3.level4
 1  .root_node.level1.level2_b.level3

If the example lacked the "root_node" tags, you would use the "-r"
option, and the quantities reported for the "level*" tags would be the
same as above.

=head1 AUTHOR

David Graff <graff at ldc.upenn.edu>

=cut

use strict;
use XML::Parser;

my $Usage = "$0 [-r] [-b] file.xml\n";
my ( $add_root, $discrete_count );
while ( @ARGV > 1 and $ARGV[0] =~ /-([rb])/ ) {
    if ( $1 eq 'r' ) {
        $add_root = shift;
    } else {
        $discrete_count = shift;
    }
}
die $Usage unless ( @ARGV == 1 and -f $ARGV[0] );

my $counter = 0;
my %embedding;
my $key = '';
my %hist;

my $p = XML::Parser->new( Handlers =>
                          { Start => sub{ my $newkey = "$key.$_[1]";
                                          if ( $key and $discrete_coun
+t and
                                               !exists( $embedding{$ke
+y} )) {
                                              $embedding{$key}++;
                                              $hist{$key}--;
                                              $counter++;
                                          }
                                          $key = $newkey;
                                          $hist{$key}++; },
                              End => sub{ delete $embedding{$key} if (
+ $discrete_count );
                                          $key =~ s/\.$_[1]$// },
                        } );
if ( ! $add_root ) {
    $p->parsefile( $ARGV[0] );
}
else {
    my $xmlstr = "<STRUCT_HIST_ROOT_$$>\n";
    open( X, '<:utf8', $ARGV[0] ) or die "Unable to read $ARGV[0]: $!\
+n";
    {
        $/ = undef;
        $xmlstr .= <X>;
    }
    close X;
    $xmlstr .= "</STRUCT_HIST_ROOT_$$>";
    $p->parse( $xmlstr );
}
for my $k ( sort keys %hist ) {
    $_ = $k;
    if ( $add_root ) {
        s/.STRUCT_HIST_ROOT_$$//;
        next unless /\S/;
    }
    print "$hist{$k}\t$_\n" unless ( $discrete_count and $hist{$k} <= 
+0 );
}
[download]

That probably isn't exactly what you're looking for, but it should give you some ideas on how to get what you want.

In reply to Re: Retrieving a List of XML Tag Names from Given File by graff
in thread Retrieving a List of XML Tag Names from Given File by tracekill

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.