Retrieving a List of XML Tag Names from Given File

tracekill has asked for the wisdom of the Perl Monks concerning the following question:

Hey all, hoping to tap the wisdom of the great perl monks. Sadly, I've found the Perl community very sparse and inaccessible for new users but hopefully this place will prove me wrong. My issue is this: I'm developing a little applet that allows the user to select a directory full of XML files (which we assume have similar formatting but may vary in the exact tag names used). The program then gleans from the contained XML files an array of all tag names which appear in the document. The reason for this is that I then need to use this array to fill up a combo box with options as the user needs to select certain tag names to associate with our set of standardized tag names. For instance, one group of XML documents may use the XML tag name "<producer>" when our group uses "<production_lead>." I've tried implementing modules such as HTML::Reader (the author of which has been very helpful in attempting to fix my problem but I fear may be a bit too slow for me to meet my deadline), and XML::Parser. The first spits out errors of vague and mysterious origin. The second seems light years more complicated than I need it to be for my application. Below is a sample of my code from the most successful implementation of a solution I've managed to author using HTML::Reader. Hopefully it will allow you to see the goal I am aiming for, though using HTML::Reader is not a necessity at all for any proposed solutions. In short, I need a simple method of getting a list of XML tag names from an XML document!

my @xmlfiles = ();
    opendir(DIR, $self->{dirtree}->GetSelectedPath()) || die "Cannot o
+pen selected path. Make sure a path is selected!";
    @xmlfiles = grep(/\.xml$/, readdir(DIR));
    closedir(DIR);
    
    my $xmlreader;
    my $showerr = 0;
    my @taglist = ();
    # For every XML file in our list...
    for(my $count = 0; $count < @xmlfiles; $count++){
        # Create an XML reader for that file, get all the tag data int
+o an array then add only relevant tag data
        # to the @taglist array.
        # $xmlreader = new HTML::TagReader $self->{dirtree}->GetSelect
+edPath() . "\\" . $xmlfiles[$count];
        # my @tagarr = $xmlreader->gettag($showerr);
        # for(my $subcount = 0; $subcount < @tagarr; $subcount++){
        #    push(@taglist, $tagarr[$subcount*3]); 
        # }
        my $infile = $self->{dirtree}->GetSelectedPath() . "\\" . $xml
+files[$count];
           my %removedumplicate;
        my @tagarr;
           my $p=new HTML::TagReader $infile;
           while(@tagarr = $p->getbytoken(!my $opt_W)){
               my $origtag =$tagarr[0];
               if($tagarr[1] eq "" || $tagarr[1] eq "!--"){ 
                   next;
               }
               if ($removedumplicate{$tagarr[0]}){
                next;
               }
       push(@taglist, $tagarr[0]);
       $removedumplicate{$tagarr[0]}++;
       }
    }
[download]

In the commented section is my previous implementation of an HTML::Reader solution. Uncommented is the author's suggestion of a possible solution after I contacted him with my problem. Any help is greatly appreciated and will be rewarded with over-the-top praise and adoration.

Comment on Retrieving a List of XML Tag Names from Given File Download Code

Replies are listed 'Best First'.
Re: Retrieving a List of XML Tag Names from Given File by graff (Chancellor) on Jul 21, 2009 at 05:17 UTC
You might want to check out this little snippet I posted here a few months ago: Get a structured tally of XML tags, although I'll admit that it's a tad gnarly as a one-liner on the command line (suitable only for use in a bourne-like shell, such as bash). Luckily, since then I have refined it into a real script with POD, command-line options and error checking: #!/usr/bin/env perl =head1 NAME xml-structure-hist =head1 SYNOPSIS xml-structure-hist [-r] [-b] file.xml -r : have the program supply a root node tag -b : show break-downs of element paths (def: raw element counts) =head1 DESCRIPTION For any given xml file, this tool will use a standard xml parser to tabulate the structure of the tags and print (on STDOUT) a tally of how many times each distinct structural element occurs in the file. Use the "-r" option if the input file does not include its own "root" xml tag (e.g. this is typical of Gigaword-style text files, which are just a concatenation of "<DOC>" elements, with no initial "root" tag containing all the DOCs). For example, given an xml file with these contents: <root_node> <level1 id="x"> <level2_a><level3>...</level3><level3>...</level3></level2_a> <level2_a><level3>...</level3><level3>...</level3></level2_a> </level1> <level1 id="y"> <level2_a><level3><level4>...</level4>...</level3></level2_a> <level2_b><level3>...</level3></level2_b> </level1> <level1 id="z"> <level2_a>...</level2_a> </level1> </root_node> the default output would be: 1 .root_node 2 .root_node.level1 4 .root_node.level1.level2_a 5 .root_node.level1.level2_a.level3 1 .root_node.level1.level2_a.level3.level4 1 .root_node.level1.level2_b 1 .root_node.level1.level2_b.level3 With the "-b" option, the output would be: 1 .root_node.level1.level2_a 4 .root_node.level1.level2_a.level3 1 .root_node.level1.level2_a.level3.level4 1 .root_node.level1.level2_b.level3 If the example lacked the "root_node" tags, you would use the "-r" option, and the quantities reported for the "level*" tags would be the same as above. =head1 AUTHOR David Graff <graff at ldc.upenn.edu> =cut use strict; use XML::Parser; my $Usage = "$0 [-r] [-b] file.xml\n"; my ( $add_root, $discrete_count ); while ( @ARGV > 1 and $ARGV[0] =~ /-([rb])/ ) { if ( $1 eq 'r' ) { $add_root = shift; } else { $discrete_count = shift; } } die $Usage unless ( @ARGV == 1 and -f $ARGV[0] ); my $counter = 0; my %embedding; my $key = ''; my %hist; my $p = XML::Parser->new( Handlers => { Start => sub{ my $newkey = "$key.$_[1]"; if ( $key and $discrete_coun +t and !exists( $embedding{$ke +y} )) { $embedding{$key}++; $hist{$key}--; $counter++; } $key = $newkey; $hist{$key}++; }, End => sub{ delete $embedding{$key} if ( + $discrete_count ); $key =~ s/\.$_[1]$// }, } ); if ( ! $add_root ) { $p->parsefile( $ARGV[0] ); } else { my $xmlstr = "<STRUCT_HIST_ROOT_$$>\n"; open( X, '<:utf8', $ARGV[0] ) or die "Unable to read $ARGV[0]: $!\ +n"; { $/ = undef; $xmlstr .= <X>; } close X; $xmlstr .= "</STRUCT_HIST_ROOT_$$>"; $p->parse( $xmlstr ); } for my $k ( sort keys %hist ) { $_ = $k; if ( $add_root ) { s/.STRUCT_HIST_ROOT_$$//; next unless /\S/; } print "$hist{$k}\t$_\n" unless ( $discrete_count and $hist{$k} <= +0 ); } [download] That probably isn't exactly what you're looking for, but it should give you some ideas on how to get what you want.	[reply] [d/l]
Re: Retrieving a List of XML Tag Names from Given File by ikegami (Patriarch) on Jul 20, 2009 at 23:24 UTC
I posted my solution here when this was asked in the chatterbox. Knowing nothing about HTML::Reader — in fact, it's not on CPAN — it seems like the wrong choice since HTML and XML aren't compatible.	[reply]
Re: Retrieving a List of XML Tag Names from Given File by grantm (Parson) on Jul 21, 2009 at 03:09 UTC
If I wanted to do this as a one-off, then I'd use XML-PYX and a shell one-liner like this: `pyx xcard.xml \| perl -nle '/^\((\S+)/ && print $1' \| sort -u` [download] To do it programmatically, I'd probably use XML-SAX: #!/usr/bin/perl -w use warnings; use strict; use XML::SAX::ParserFactory; my $handler = TagListHandler->new; my $parser = XML::SAX::ParserFactory->parser(Handler => $handler); my $filename = shift or die "No filename\n"; my $tags = $parser->parse_uri($filename); print "$_\n" foreach @$tags; exit; package TagListHandler; use base qw(XML::SAX::Base); sub start_element { my($self, $data) = @_; $self->{tag}->{ $data->{Name} } = 1; } sub end_document { my($self) = @_; my @tags = sort keys %{ $self->{tag} \|\| {} }; return \@tags; } [download] If you install XML::SAX you'll get a pure Perl parser so you'll also want to install a faster parser module like XML-SAX-Expat.	[reply] [d/l] [select]