http://qs1969.pair.com?node_id=105906

Hello XML fans, it's time to do some Prolog-like search and query on a small XML database. What is shown below is an adjacency map. It is an XML document which shows which cities are next to which other cities. The utility of such a document/data structure can be imagined to be if a person had an inter-city travel ticket and wanted to look up which cities were next to his. Which cities were two away, etc, etc.

While you could use normal nested Perl data structures to deal with this, XML is becoming en vogue and as a result we have to be just as fashionable. Actually, this isn't true, we can always use Gisle Aas' Data::XMLDumper to convert XML to-and-for Perl nested data structures. But for the purpose of this tutorial, we will act like that module doesn't exist.

So without further adieu, I present the XML document detailing the (far too windy) part of the world I currently live in (and will be escaping from as soon as Christmas is here):

<border_list> <pair><city>mountain view</city><city>sunnyvale</city></pair> <pair><city>mountain view</city><city>palo alto</city></pair> <pair><city>menlo park</city><city>palo alto</city></pair> <pair><city>atherton</city><city>menlo park</city></pair> <pair><city>atherton</city><city>redwood city</city></pair> <pair><city>san carlos</city><city>redwood city</city></pair> <pair><city>san carlos</city><city>belmont</city></pair> <pair><city>hillesdale</city><city>belmont</city></pair> <pair><city>hillesdale</city><city>san mateo</city></pair> </border_list>

Ok, so now what

So, now that I have shown the data, it is time to grok it, munge it, eat it for breakfast as a meal replacement and basically put it at it's knees to do our bidding.

Program One: find all cities next to menlo park

Ok, here is a program to grok this XML-base for all cities next menlo park:
use XML::Twig; my $t = XML::Twig->new(PrettyPrint => 'record'); $t->parsefile('adj.xml'); my $root = $t->root; # @pair has all the pairs of adjacent cities in it my @pair = $root->children; # target city we are looking for my $city = 'menlo park'; # this routine takes a search text and a list of XML elements and # searches them for the text sub candidate_generator { my ($search_text, @data) = @_; grep { grep { $_->text eq $search_text } $_->children } @data; } # take the entire XML-base and search for records which have our # target city in them my @adj = candidate_generator($city,@pair); # print them out in a human-readable form map { $_->print } @adj;
and here is the pretty output:
<pair> <city>menlo park</city> <city>palo alto</city> </pair> <pair> <city>atherton</city> <city>menlo park</city> </pair>

all done

The program was documented, so it should make sense, but let's take a closer look at candidate_generator().
# this routine takes a search text and a list of XML elements and # searches them for the text sub candidate_generator { my ($search_text, @data) = @_; grep { grep { $_->text eq $search_text } $_->children } @data; }
It consists of two nested greps and hence can be a little confusing. Depending on the way you think you might want to think about the outer grep and then the inner grep or vice versa. It is only fitting that I discuss both methods of program comprehension.

Let's do top-down first. The outer grep is basically saying: take all the XML records and only return the ones which satisfy the inner search criteria. The inner search criteria takes each individual XML record and looks at each of it's children, where each child is a city and examines its text for equality with the text to be searched for, or concretely speaking menlo park.

Ok, now bottom up. The innermost expression is  $_->text eq $search_text and what this does is take an XML element and get its text and compare it to a normal Perl string. So if $elt was an XML::Twig::Elt representing

<city>boise</city>
then $elt->text would be boise. Now we work out a bit more. And a bit more out is  grep { YADAYADA } $_->children So here we take advantage of the fact that the XML is structured so that neighboring cities are both children of the pair element, e..g:
<pair><city>mountain view</city><city>sunnyvale</city></pair>
and we are just checking to see if either child is the text we are looking for. And now we finally make it to the outer grep and the first sentence in the top-down description says what that is doing.

th-th-th-that's the first post, folks

Anyway, that was the first in a series of 3 posts. The next two will do slightly more advanced searching and in the process introduce a call or two more from the XML::Twig API.

Replies are listed 'Best First'.
Re: Adjacency List Processing in XML::Twig
by japhy (Canon) on Aug 18, 2001 at 16:55 UTC
    I'd change the double-grep (which also happens to be a nested grep) to a grep-map:
    grep { $_->text eq $search_text } map { $_->children } @data; # or grep $_->text eq $search_text, map $_->children, @data;

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Changing the nested grep to a grep map would change the semantics of the code.
      grep { grep { $_->text eq $search_text } $_->children } @data;
      This nested grep returns a list of elements for which any child contains the search text.
      grep { $_->text eq $search_text } map { $_->children } @data;
      This grep map a list of individual children that contain the search text.

       

      With a search for 'menlo park', the nested grep returns two elements; one has the children '<city>menlo park</city>' and '<city>palo alto</city>', and the other has the children '<city>atherton</city>' and '<city>menlo park</city>'.

      The grep map, on the other hand, would just return '<city>menlo park</city>' and '<city>menlo park</city>'.

        Oh, right. Sorry, princepawn, I misconstrued the purpose of that part of the code. That was silly of me.

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;