You might want to check out this little snippet I posted here a few months ago: Get a structured tally of XML tags, although I'll admit that it's a tad gnarly as a one-liner on the command line (suitable only for use in a bourne-like shell, such as bash).

Luckily, since then I have refined it into a real script with POD, command-line options and error checking:

#!/usr/bin/env perl =head1 NAME xml-structure-hist =head1 SYNOPSIS xml-structure-hist [-r] [-b] file.xml -r : have the program supply a root node tag -b : show break-downs of element paths (def: raw element counts) =head1 DESCRIPTION For any given xml file, this tool will use a standard xml parser to tabulate the structure of the tags and print (on STDOUT) a tally of how many times each distinct structural element occurs in the file. Use the "-r" option if the input file does not include its own "root" xml tag (e.g. this is typical of Gigaword-style text files, which are just a concatenation of "<DOC>" elements, with no initial "root" tag containing all the DOCs). For example, given an xml file with these contents: <root_node> <level1 id="x"> <level2_a><level3>...</level3><level3>...</level3></level2_a> <level2_a><level3>...</level3><level3>...</level3></level2_a> </level1> <level1 id="y"> <level2_a><level3><level4>...</level4>...</level3></level2_a> <level2_b><level3>...</level3></level2_b> </level1> <level1 id="z"> <level2_a>...</level2_a> </level1> </root_node> the default output would be: 1 .root_node 2 .root_node.level1 4 .root_node.level1.level2_a 5 .root_node.level1.level2_a.level3 1 .root_node.level1.level2_a.level3.level4 1 .root_node.level1.level2_b 1 .root_node.level1.level2_b.level3 With the "-b" option, the output would be: 1 .root_node.level1.level2_a 4 .root_node.level1.level2_a.level3 1 .root_node.level1.level2_a.level3.level4 1 .root_node.level1.level2_b.level3 If the example lacked the "root_node" tags, you would use the "-r" option, and the quantities reported for the "level*" tags would be the same as above. =head1 AUTHOR David Graff <graff at ldc.upenn.edu> =cut use strict; use XML::Parser; my $Usage = "$0 [-r] [-b] file.xml\n"; my ( $add_root, $discrete_count ); while ( @ARGV > 1 and $ARGV[0] =~ /-([rb])/ ) { if ( $1 eq 'r' ) { $add_root = shift; } else { $discrete_count = shift; } } die $Usage unless ( @ARGV == 1 and -f $ARGV[0] ); my $counter = 0; my %embedding; my $key = ''; my %hist; my $p = XML::Parser->new( Handlers => { Start => sub{ my $newkey = "$key.$_[1]"; if ( $key and $discrete_coun +t and !exists( $embedding{$ke +y} )) { $embedding{$key}++; $hist{$key}--; $counter++; } $key = $newkey; $hist{$key}++; }, End => sub{ delete $embedding{$key} if ( + $discrete_count ); $key =~ s/\.$_[1]$// }, } ); if ( ! $add_root ) { $p->parsefile( $ARGV[0] ); } else { my $xmlstr = "<STRUCT_HIST_ROOT_$$>\n"; open( X, '<:utf8', $ARGV[0] ) or die "Unable to read $ARGV[0]: $!\ +n"; { $/ = undef; $xmlstr .= <X>; } close X; $xmlstr .= "</STRUCT_HIST_ROOT_$$>"; $p->parse( $xmlstr ); } for my $k ( sort keys %hist ) { $_ = $k; if ( $add_root ) { s/.STRUCT_HIST_ROOT_$$//; next unless /\S/; } print "$hist{$k}\t$_\n" unless ( $discrete_count and $hist{$k} <= +0 ); }
That probably isn't exactly what you're looking for, but it should give you some ideas on how to get what you want.

In reply to Re: Retrieving a List of XML Tag Names from Given File by graff
in thread Retrieving a List of XML Tag Names from Given File by tracekill

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.