Boss just asked me how long to transform an XML file to another format, given that each XML tag will equate to a unique value in the output format. So I gave him my rate to code a lookup table of {XML tag names => unique values} and said to divide that into the number of unique tags. Of course he wanted me to count the XML tags, and I came up with this. Requires XML::Parser and related dependencies.
#!/usr/bin/perl -w use strict; use XML::Parser; my %tags; my $p = XML::Parser->new(Handlers => { Start => sub {$tags{$_[1]}++; }, }, ) or die "cannot create parser :: $!"; foreach my $file (@ARGV) { eval { $p->parsefile($file); }; die $@ if $@; } print "The keys are\n"; print map "$_\n", sort keys %tags; print "There are ", scalar(keys %tags), " tags in the files\n";

Replies are listed 'Best First'.
Re: Count number of unique tags in XML files
by mirod (Canon) on Mar 26, 2004 at 08:24 UTC

    I like using pyx (installed by XML::PYX) for that kind of quick "grab info from a file":

    pyx file.xml | perl -ln -e'if( /^\((.*)$/) { $tags{$1}++; } \ END { print foreach (sort keys %tags);\ print scalar(keys %tags), " tags in + the files"; }'
Re: Count number of unique tags in XML files
by hawtin (Prior) on Mar 26, 2004 at 08:32 UTC

    Almost uniquely with XML this is a case where a regex would also work:

    perl -n -e 'while(s/<([\w\d\:]+)//){$f{$1}++;}\ END{print map "$_\n", sort keys %f;\ print "There are ", scalar(keys %f), " tags in the files\n"} * +.xml

    Assuming well formed XML of course

      Of course not:

      What if the XML includes this: <!-- <tag>this tag commented out</tag> -->? Though this might look contrived, you can actually find it in the XML for the XML recommendation itself.

      Then what if the XML is:

      <!DOCTYPE foo SYSTEM "foo.dtd" []> <foo>&bar</foo>
      You have no idea what's inside the entity. it could be just text, or it could include 278 unique tags. Note that this breaks all 3 pieces of code above, as XML::Parser (and thus pyx) do not expand external entities. The easiest solution I found uses... XML::Twig as usual!.

      perl -MXML::Twig -e'XML::Twig->new( expand_external_ents => 1)->parsefile( shift )->print'

      will expand external entities, and then the regular pyx or XML::Parser solution will work.