Falantar has asked for the wisdom of the Perl Monks concerning the following question:

Hello, In the past couple of days I managed to make a script that parses through a specific XML file and returns only a few values. To be more specific, the data is formated like this:

__DATA__ <!-- Dog section --> <object type="Dog" > <property name="id" value="0" /> <property name="name" value="REX" /> <property name="status" value="alive" /> <property name="mode" value="owned" /> <property name="dog_breed_id" value="0" /> <property name="dog_breed_name" value="Husky" /> <property name="capacity" value="105" /> <property name="size" value="big" /> <property name="location" value="canada" /> </object >

This is just an example and not the actual values but you get the idea. There will also be several of these "objects" one after another. The script I made used the Tree style, I'm now trying to create one using Stream. The only info I want to extract is the name, the dog_breed_id, and the dog_breed_name. I would like the output to be simple and only seperated by commas, like this:

REX,0,Husky

I'm a novice programmer but I'm a pretty quick learner so don't be shy. My main purpose is to make it run faster but also to learn how to use different methods of parsing data.

Edit: For those interested, I used XML::Rules, this is what the final code looks like.

#!/usr/bin/perl -w use strict; use warnings; use vars qw/ %options /; use Getopt::Std; use XML::Rules; # How to use the script sub Usage(){ print STDERR " Usage : $0 [-arg file] arg: -i : load information from XML file \n"; exit 2; } my $opt_string = 'i:'; my $File; my %options; getopts("$opt_string", \%options ); Usage unless ( %options ); foreach ( (my $key) = (keys %options) ){ $File = $options{$key}; } Usage unless(-f $File); my @rules = ( object => sub { if( $_[1]{type} eq 'dog' ){ print join(",", @{$_[1]}{qw(dog_brd_id dog_brd_name name)}),"\n"; return; } elsif($_[1]{type} eq 'cat'){ XML::Rules->return_nothing; } }, property => sub {$_[1]->{name} => $_[1]->{value}}, ); my $xr = XML::Rules->new( rules => \@rules, stripspaces => 2 ); $xr->parsefile($File);

This obviously depends that the Dog section comes before the Cat section. Using this method though, I was able to bring the real time from 20s to 0.5s. Thanks to everyone who helped!!

Replies are listed 'Best First'.
Re: XML::Parser using stream
by toolic (Bishop) on Aug 17, 2011 at 16:33 UTC
    I first used XML::Parser, but then quickly switched to XML::Twig because I find its user interface easier to work with. If you are willing to entertain a Twig solution....
    use warnings; use strict; use XML::Twig; my $str = <<EOF; <foo> <!-- Dog section --> <object type="Dog" > <property name="id" value="0" /> <property name="name" value="REX" /> <property name="status" value="alive" /> <property name="mode" value="owned" /> <property name="dog_breed_id" value="0" /> <property name="dog_breed_name" value="Husky" /> <property name="capacity" value="105" /> <property name="size" value="big" /> <property name="location" value="canada" /> </object > </foo> EOF my $t = XML::Twig->new( twig_handlers => { object => \&woof } ); $t->parse($str); sub woof { my ($t, $obj) = @_; if ($obj->att('type') eq 'Dog') { my $name; my $id; my $bname; for my $prop ($obj->children('property')) { $name = $prop->att('value') if $prop->att('name') eq 'nam +e'; $id = $prop->att('value') if $prop->att('name') eq 'dog +_breed_id'; $bname = $prop->att('value') if $prop->att('name') eq 'dog +_breed_name'; } print "$name,$id,$bname\n"; } } __END__ REX,0,Husky

      The code works well except now I'm getting a Segmentation Fault when I run the script. It occurs after my data has been printed, hangs for a while then throws the error message. This only happens on the XML files over 10mb. The small one doesn't get this error. Here's what the code looks like:

      use strict; use warnings; use XML::Twig; use vars qw/ %options /; use Getopt::Std; use Switch; #---- #Code to treat the file #---- my $t = XML::Twig->new( twig_handlers => { object => \&animal_handler } ); $t->parsefile($File); sub animal_handler { my ($t, $obj) = @_; if ($obj->att('type') eq 'dog') { my $name; my $id; my $bname; for my $prop ($obj->children('property')) { $name = $prop->att('value') if $prop->att('name') eq 'nam +e'; $id = $prop->att('value') if $prop->att('name') eq 'dog +_breed_id'; $gname = $prop->att('value') if $prop->att('name') eq 'dog +_breed_name'; } print "$id,$name,$bname\n"; } }
      Any ideas why i'm getting this error? Or even a way to trace what part of the file or code it's occuring at?
        use Switch;
        The first thing to try is to get rid of all code that uses Switch.
Re: XML::Parser using stream
by runrig (Abbot) on Aug 17, 2011 at 19:45 UTC
    Another option is XML::Rules:
    use strict; use warnings; use XML::Rules; use Data::Dumper qw(Dumper); my $xml = <<XML; <object type="Dog" > <property name="id" value="0" /> <property name="name" value="REX" /> <property name="status" value="alive" /> <property name="mode" value="owned" /> <property name="dog_breed_id" value="0" /> <property name="dog_breed_name" value="Husky" /> <property name="capacity" value="105" /> <property name="size" value="big" /> <property name="location" value="canada" /> </object > XML my @rules = ( "^object" => sub { $_[1]{type} eq 'Dog' }, object => sub { print join("|", @{$_[1]}{qw(name dog_breed_id dog_breed_name)}),"\ +n"; return; }, property => sub {$_[1]->{name} => $_[1]->{value}}, ); my $xr = XML::Rules->new( rules => \@rules, stripspaces => 2 ); $xr->parse($xml);
      That is excellent. Seems like a great parser, wonder why I haven't heard of it before! Quick and I don't get a segfault either. Now I just need to make the script stop once all of the info has been processed. Since it's all together in the same group, it doesn't have to keep analyzing the rest of the file.
Re: XML::Parser using stream
by Perlbotics (Archbishop) on Aug 17, 2011 at 16:53 UTC

    toolic already showed you one right way to do it ™.   So, for educational purpose only, here's an example from the dark side of XML parsing.

    Sometimes this approach works, but requires that you can be really sure, that your XML is at max. one tag per line and attribute order is correct and no attribute is named TYPE and case is correct and proper quoting is used and no nested quoting occurs and ...
    For each of these restrictions, there's a workaround, but you get the picture.

    So this approach often works iff the producer of the XML document adheres to the contract. Then, the benefit might be a slight speed improvement.
    Usually unexpected things happen many moons later.

    use strict; use warnings; my $record; while ( <DATA> ) { next if /^\s*<!--/; next if /^\s*$/; $record = { TYPE => $1 }, next if /<object/ and /type="([^"] +*)"/; $record->{$1} = $2, next if /<property/ and /name="([^"] +*)"\s+value="([^"]*)"/; if ( /<\/object/ ) { # flush current record ( print join(',', map { $record->{$_} // 'n/a' } qw(name dog_breed_id dog_breed_name)),"\n" ) if $record->{TYPE} eq 'Dog'; $record = undef; # not really necessary next; } warn "Unexpected input (line $.): $_"; } __DATA__ <!-- Dog section --> <object type="Dog" > <property name="id" value="0" /> <property name="name" value="REX" /> <property name="status" value="alive" /> <property name="mode" value="owned" /> <property name="dog_breed_id" value="0" /> <property name="dog_breed_name" value="Husky" /> <property name="capacity" value="105" /> <property name="size" value="big" /> <property name="location" value="canada" /> </object > <object type="Fish" > <property name="id" value="8" /> <property name="name" value="Ginger" /> <property name="status" value="alive" /> <property name="mode" value="owned" /> <property name="fish_breed_id" value="4" /> <property name="fish_breed_name" value="Guppy" /> <property name="capacity" value="105" /> <property name="size" value="big" /> <property name="location" value="glassbowl" /> </object > <object type="Dog" > <property name="id" value="9" /> <property name="name" value="Norbert" /> <property name="status" value="alive" /> <property name="mode" value="owned" /> <property name="dog_breed_id" value="7" /> <property name="dog_breed_name" value="Norwegian Wolf" /> <property name="capacity" value="105" /> <property name="size" value="big" /> <property name="location" value="norway" /> </object > <comment value="This should trigger a warning..." />

    Output:

    EX,0,Husky Norbert,7,Norwegian Wolf Unexpected input (line 39): <comment value="This should trigger a wa +rning..." />

Re: XML::Parser using stream
by Falantar (Initiate) on Aug 17, 2011 at 17:10 UTC

    Thank you both for the quick response. Toolic, thanks for the suggestion to go with twig. I hadn't read much on it because parser did pretty much what I needed it to do until now. I'm going to run some benchmarks when I have the chance.

    Perlbotics, thanks for the info. It doesn't seem like a likely route i'll take because I don't even know who handles the XML file but I'll take your code and dissect it to see what I can learn from it.