cybär has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,
I want to find in a xml file a group of specific strings.
For example I want to find all people which are 43 years
old (following tags):
<name>John Doe</name>
<age>43</age>
But I don't know ow to do it right.
my code so far:
$pattern1 = "<name>.*</name>";
$pattern2 = "<age>43</age>";
$file = "data.xml";
$sep = "\n";
open(IN,"<$file");
chomp(@content=<IN>);
close IN;
$list = join ($sep,@content);
if ($list =~ m/$pattern1$sep$pattern2/gi){
print "\n::$&::\n";
print "\n::$1::\n";
print "\n::$2::\n";
}

I found only one entry but there are more.
Perhaps the join function is not the right in my case, but
I didn't know how to look up in an array in my special
case, because the two tags are in the array two elements.
Please help me.
thanks in advance
cybär
  • Comment on I want to find a group of pattern in a xml file

Replies are listed 'Best First'.
Re: I want to find a group of pattern in a xml file
by Fletch (Bishop) on Sep 16, 2008 at 13:30 UTC

    a) you really don't want to use regexen to parse XML, 2) XML::Twig, III) profit!

    Update: To get you kick started, you'd want to use xpath along the lines of age[string()=~ /\A 43 \z/x] then find the name node using something like prev_sibling("name").

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      First of all I tried it with some xml parser modules.
      But I had trouble without end.
      May you give me an example to find two tags in series with a
      xml parser module?
      I would be so pleased.
Re: I want to find a group of pattern in a xml file
by grinder (Bishop) on Sep 16, 2008 at 14:32 UTC

    First of all, I'll let you into a dirty little secret. 99.99% of the time, you can quite easily parse XML files with regular expression. This is because 99.99% of the time you deal with only one external party sending you XML files, and they don't code it by hand, they wrote a program to generate it.

    And the thing is, they don't modify the program once it's in production, or rarely or deeply enough for it matter to you. This means that once you have figured out what the file looks like by empirical observation, you can write a few short patterns to pull out what you need.

    You really need to parse XML files when you have written the spec, and many people are sending you their data based on your spec. But I digress.

    When you say you want the contents of NAME and AGE elements, you probably have more context lying around in the file. Such as a PERSON element that encompasses them, otherwise you might get confused by <tree><age>437</age><name>Sequoia</name></tree> elements. To disambiguate this, you want the NAME element within the PERSON element, along with the AGE element of the PERSON element.

    Furthermore, you don't know if you'll see the NAME element first, or the AGE element first. That is, you might have <person><age>56</age><name>Alice</name></person> or <name>Bill</name><age>28</age>. So what you do is you keep track of each one you find, in a hash, and after you find another element, you check to see if you have both of them, and if so you do something with them.

    The following code uses XML::Twig to implement the above algorithm. I haven't tested to see whether it compiles, but suc minor details will be cleaned up by the Chatterbox crew if you care to ask them :)

    use strict; use warnings; use XML::Twig; my $twig = do { my %seen; XML::Twig->new( twig_handlers => { 'PERSON/NAME' => sub { my ($t, $e) = @_; $seen{NAME} = $e->text; check(\%seen); }, 'PERSON/AGE' => sub { my ($t, $e) = @_; $seen{AGE} = $e->text; check(\%seen); } } ) }; sub check { my $person = shift; return unless keys %$person == 2; print "$person->{NAME} is $person->{AGE} years old.\n"; %$person = (); } for my $file (@ARGV) { $twig->parsefile($file); }

    • another intruder with the mooring in the heart of the Perl

Re: I want to find a group of pattern in a xml file
by toolic (Bishop) on Sep 16, 2008 at 13:51 UTC
    Here is a quick example using XML::Twig to get you going. Of course, there are many ways to do this, and this may not be optimal for your purposes, but it's a start:
    use strict; use warnings; use XML::Twig; my $xfile = <<EOF; <people> <person><name>Jane Doe</name><age>42</age></person> <person><name>John Doe</name><age>43</age></person> <person><name>Foo Bar</name><age>43</age></person> </people> EOF my $t= new XML::Twig(); $t->parse($xfile); my $people = $t->root(); my @persons = $people->children('person'); for (@persons) { my $name = $_->first_child('name')->text(); my $age = $_->first_child('age' )->text(); if ($age == 43) { print "$name is $age\n" } } __END__ John Doe is 43 Foo Bar is 43
Re: I want to find a group of pattern in a xml file
by apl (Monsignor) on Sep 16, 2008 at 13:39 UTC
    What Fletch said. Plus, be very careful about how you match age. Your XML might contain <age> 43 </age>. (Note the embedded blanks; your original test would have failed.)
Re: I want to find a group of pattern in a xml file
by dHarry (Abbot) on Sep 16, 2008 at 13:45 UTC

    Somehow the idea to “parse” a XML file with a regexp keeps popping up. In general this is not the way. Use a XML parser for this job! There are several alternatives, see CPAN:-)

    In case of the suggested XML::Twig you could use an XPath expression to filter out the people you’re looking for. Or go over all of them and test for the age element to have a text value of 43.

    Update:
    Something like (going over all people):

    use strict; use warnings; use XML::Twig; # Usage of finish_print my $twig= new XML::Twig( twig_roots => { 'name' => \&name } ); $twig->set_pretty_print ('nsgmls'); # Human readable output please $twig->parsefile( "some_file_name.xml"); sub Name { my( $twig, $name)= @_; $name = $name->first_child("First"); if ($name->text eq "43") { # Do stuff } else { } }
    Untested as you don't provide the exact structure of the XML file.

    Update 2
    Fixed typo's

Re: I want to find a group of pattern in a xml file
by kubrat (Scribe) on Sep 16, 2008 at 14:44 UTC

    Or you could just use XML::Simple. Borrowing from toolic's example:

    use strict; use warnings; use XML::Simple; my $xfile = <<EOF; <people> <person><name>Jane Doe</name><age>42</age></person> <person><name>John Doe</name><age>43</age></person> <person><name>Foo Bar</name><age>43</age></person> </people> EOF my $people = XMLin($xfile); foreach (keys %{$people->{person}}) { my $person = $people->{person}->{$_}; next if $person->{age} != 43; print "$_ is $person->{age}\n"; } __END__ John Doe is 43 Foo Bar is 43

      But everytime when I use one of the xml parser modules (twig, simple, etc.) I got errors.
      With the code from superdoc I get following error:
      syntax error at line 1, column 0, byte 0 at C:/Perl/site/lib/XML/Parser.pm line 187
      The only thing I changed, was instead of the $xfile=text, I wrote $xfile="data.xml";
      Why the hell it's so difficult to use the perl xml modules?