Desdinova has asked for the wisdom of the Perl Monks concerning the following question:

For the the first in almost a year of being here I am posting a question.. that says alot about much info is already here.. but anyway
I have a "XML like" file in this format:
<post> <jobnumber>1234</jobnumber> <location> somecity NJ <location> </post> <post> <jobnumber>87922</jobnumber> <location> Othercity, AK <location> </post>
And so on
What I would like to be able to do is read in the first "post" and assign the content to a hash with the elements as keys so that i can access the data like

print "City: $hash{location} \n";

Then go on to the next "post" and repeat the proccessing of the data inside it. I have looked at the man pages for XML::Parser and XML::Simple but I can't figure out how to go through one post at a time. I figure there is something simple about all this that I am just missing and would appreciate if some helpful monk could point my brain in the right direction.

Replies are listed 'Best First'.
Re: Pasring XML into a simple hash
by Kanji (Parson) on Jun 21, 2001 at 09:38 UTC
    I have a "XML like" file <...>

    If you don't want to write a custom parser from scratch or within the framework something like Parse::RecDescent provides, you'll need to convert your file into valid XML before you could use XML::Parser or XML::Simple.

    But if you do, doing what you need is a cinch ...

    use XML::Simple; my $xml = XMLin(<<__XML__); <posts> <post> <jobnumber>1234</jobnumber> <location>Somecity, NJ</location> </post> <post> <jobnumber>87922</jobnumber> <location>Othercity, AK</location> </post> </posts> __XML__ foreach my $post ( @{ $xml->{'post'} } ) { print "City: $post->{'location'}\n"; }


Re: Parsing XML into a simple hash
by bikeNomad (Priest) on Jun 21, 2001 at 10:03 UTC
    Or you could force it into a single element by attaching tags to both ends:
    #!/usr/bin/perl -w use strict; use XML::Parser; my %hash; my $depth = 0; my @tags; sub start { my ($expat, $element) = @_; push(@tags, $element); $hash{$tags[-1]} = ''; } sub end { pop(@tags); if (@tags == 1) { delete $hash{posts}; delete $hash{post}; # now you have hash. print "Job: $hash{jobnumber}\n"; print "City: $hash{location}\n"; %hash = (); } } sub char { my ($expat, $string) = @_; $hash{$tags[-1]} .= $string; } my $text = <<'EOF'; <post> <jobnumber>1234</jobnumber> <location> somecity NJ </location> </post> <post> <jobnumber>87922</jobnumber> <location> Othercity, AK </location> </post> EOF my $p1 = new XML::Parser(Handlers => { Start => \&start, End => \&end, Char => \&char }); $p1->parse("<posts>$text</posts>");
Re: Parsing XML into a simple hash
by mirod (Canon) on Jun 21, 2001 at 16:00 UTC

    And here is the ObXTW (the Obligatory XML::Twig Way), once you've fixed your XML by wrapping everything into a single element:

    #!/bin/perl -w use strict; use XML::Twig; my $t= new XML::Twig( twig_handlers => { post => \&post }); $t->parse( \*DATA); sub post { my( $t, $post)= @_; # all handlers get called with those arguments # here is the magic! # gi is the element name and text is its... text! my %hash= map { $_->gi, $_->text} $post->children; # or whatever you want to do with the hash print "City: $hash{location} \n"; # if your file is small enough you don't need to purge, otherwise # it will free the memory used so far $t->purge; } __DATA__ <posts> <post> <jobnumber>1234</jobnumber> <location> somecity NJ </location> </post> <post> <jobnumber>87922</jobnumber> <location> Othercity, AK </location> </post> </posts>
Re: Pasring XML into a simple hash
by strredwolf (Chaplain) on Jun 21, 2001 at 10:45 UTC
    If you look at my chatterbox, you'll find a SGML pharser (read: precursor to XML, will work here). It'll split on those tags, so you can plop those locations into seperate posts (say, inside a @posts array).

    I gotta put it into a module...


Re: Parsing XML into a simple hash
by mattr (Curate) on Jun 21, 2001 at 11:15 UTC
    If you don't have strict XML (i.e. no ending /location) tag why not just use regular expressions? This seems to work..
    #!/usr/bin/perl use strict; open (IN,"testxml.dat"); my @buf = <IN>; close IN; for (my $i=0; $i<=$#buf; $i++) { if ($buf[$i] =~ s/^\s*<jobnumber>(.*)<\/jobnumber>\s*$/$1/) { $buf[$i+1] =~ s/^\s*<location>\s*(.*)\s<location>\s*$/$1/; # if your tags are really like this &process($buf[$i],$buf[$i+1]); } } sub process { my ($jobnumber,$location) = @_; print "Found a job $jobnumber in $location.\n"; # do something }
    On a related note, I tried to lose the spaces inside the location tags and couldn't get this kind of regex to work, anyone?
    $buf[$i+1] =~ s/^\s*<location>\s*(.?)\s*<location>\s*$/$1/; # \s?(.*)\s works though..

      Your solution does not work. Initially, it will appear to work against his data set, but XML start and end tags don't have to appear on the same line. If that happens, your regex will break because the dot metacharacter doens't match the newline. Adding the /s modifier allows the dot to match, but then, because your match is greedy, it still breaks:

      #!/usr/bin/perl use strict; my @buf = <DATA>; for my $i ( 0 .. $#buf ) { if ($buf[$i] =~ s/^\s*<jobnumber>(.*)<\/jobnumber>\s*$/$1/s) { $buf[$i+1] =~ s/^\s*<location>\s*(.*)\s<location>\s*$/$1/s; # if your tags are really like this &process($buf[$i],$buf[$i+1]); } } sub process { my ($jobnumber,$location) = @_; print "Found a job $jobnumber in $location.\n"; # do something } __DATA__ <posts> <post> <jobnumber> 1234 </jobnumber> <location>Somecity, NJ</location> </post> <post> <jobnumber>87922</jobnumber> <location>Othercity, AK</location> </post> </posts>

      See Death to Dot Star! for the explanation of why your regex fails (and for some excellent examples of how I have screwed up regexes on delimited text).

      Use a parser for data like this. Regexes, while I love them, are for matching data, not parsing it.

      As for your 'related note', it doesn't work because you have (.?) in your code. The dot/question mark makes you match one character and have that match optional. It's equivalent to (.{0,1}).


      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

        Thanks Ovid, you're right and I'll reread that article!
andye Re: Pasring XML into a simple hash
by andye (Curate) on Jun 21, 2001 at 15:47 UTC
    For a quick-and-dirty regexp solution, how about this...
    while ($text =~ m|<jobnumber>(.*?)</jobnumber>.*?<location>(.*?)</loc +ation>|sg) { print "Found job number $1 in location $2 \n"; }
    NB I'm assuming the missing slashes in the data are a typo. If not then it's easy enough to modify the above.

    Of course, a regexp solution isn't the right one if you want it to work in more general cases - like if the tags are the other way round, or whatever. If there's going to be any variation in the data, then the way to go is an XML parser as described by others above.


    Some Time Later: Just For Fun, I tried to see if I could write one that /would/ work with the tags either way round... came up with this...

    my $regexp = '<post>'; $regexp .= "(?=.*?<$_>(.*?)</$_>)" foreach qw(jobnumber location); $regexp .= '.*?</post>'; while ($stuff =~ m|$regexp|sog) { print "Found job number $1 in location $2 \n"; }
Re: Pasring XML into a simple hash
by Desdinova (Friar) on Jun 22, 2001 at 07:28 UTC
    Thanks for all the help everyone. I ended up going with XML::Twig beacuse when i looked at it just clicked in my brain. As a side note the lack of the closing location tag was a typo that I'didnt notice until everyone came up with how to fix that as well as do what I wanted. Thanks for all the help.