Using XPath to Retrieve Remote XML Data

Spenser has asked for the wisdom of the Perl Monks concerning the following question:

Lately, I've been attempting to teach myself XML. I've almost completely read "Learning XML" (O'Reilly) and "XML Visual Quickstart" (Peachpit). I've read much of O'Reilly's new book, "Perl & XML"--a very well written and useful book on the subject, incidentally. I've been reading tutorials and articles and postings and so forth all over the web (include PM using SuperSearch). I'm worn out on the subject, but I am starting to put things together in my head and on my computer. However, I need a little help with one of my learning exercises I created for myself.

Below is a simple script I whipped together to experiment with the Perl module, XPath. The script reads in an XML text file which I copied from the Newest Nodes XML Generator link.

#!/usr/bin/perl

use CGI;
use XML::XPath;
use XML::XPath::XMLParser;
my $q = new CGI;

my $file = 'newmonk_perlquestions.xml';
my $xpath = XML::XPath->new(filename=>$file);

# Find only new perl questions
my $nodeset = $xpath->find('//NODE[@nodetype="perlquestion"]');

# Create web page to display matching nodes
print 
   $q->header( -type=>'text/html'),
   $q->start_html(-title=>'New Perl Questions'),
   "<b>New Perl Questions</b>", $q->br, $q->hr,

foreach $node($nodeset->get_nodelist) {
   $nodeinfo = XML::XPath::XMLParser::as_string($node);
   $nodeinfo =~ /node_id=\"(\d+)/;
   print 
     $q->a({-href=>"http://www.perlmonks.org/index.pl?node_id=$1"},
     $nodeinfo),
     $q->br;
}
   print $q->end_html;

exit;
[download]

This works fine. However, I thought it would be cool to have $file point to the PM site address rather than a copy saved to my server. When I replace the $file declaration line with the following:

my $file = "http://www.perlmonks.org/index.pl?node=newest%20nodes%20xm
+l%20generator";
[download]

I get a "Cannot open file..." error message. I can open it manually in my web browser, but not through this Perl script. I know I'm missing something conceptually, as well as specifically. Can any one help me with this exercise?

-Spenser

That's Spenser, with an "s" like the detective.

Comment on Using XPath to Retrieve Remote XML Data Select or Download Code

Replies are listed 'Best First'.
Re: Using XPath to Retrieve Remote XML Data by the pusher robot (Monk) on Aug 26, 2002 at 22:26 UTC
XML::XPath only supports local files. Try using LWP or some such to grab the page, then use XML::XPath to process it. (I wish I could be a bit more helpful, but I haven't actually used either one before :-)	[reply]
Re: Using XPath to Retrieve Remote XML Data by lestrrat (Deacon) on Aug 26, 2002 at 22:34 UTC
If you don't mind using XML::LibXML, it supports http URIs ;)	[reply]
CGI to display newest questions only by lestrrat (Deacon) on Aug 29, 2002 at 05:50 UTC
So I wrote something on my scratchpad and sent to Spenser, and he recommends posting it here for others to see. It takes a little bit of time fetching and parsing the data, but it does the trick, anyway :) #!/usr/local/bin/perl use strict; use XML::LibXML; use CGI; use constant XML_SOURCE => 'http://www.perlmonks.org/index.pl?node_id= +30175'; sub main { my $parser = XML::LibXML->new(); my $dom = eval{ $parser->parse_file( XML_SOURCE ) }; die if $@; my $q = CGI->new(); print $q->header( -type => 'text/html' ), $q->start_html( -title => 'New Perl Questions' ), $q->b( "New Perl Questions" ), $q->br, $q->hr; foreach my $node ( $dom->findnodes( '/NEWESTNODES/NODE[ @nodetype += "perlquestion" ]' ) ) { print $q->a( { -href => sprintf( 'http://www.perlmonks.org/index.pl?node_id=%d', $node->findvalue( '@node_id' ) ) }, $node->textContent() ), $q->br; } print $q->end_html(); } main(); [download]	[reply] [d/l]
Retrieving Remote XML Data by Spenser (Friar) on Aug 27, 2002 at 20:00 UTC
Since this posting, I've been playing with other alternatives to XML::XPath, some of those suggested here like XML::LibXML. However, I'm still not getting my core problem resolved. So, let me restate my question and ask that my original code and question be set aside for the moment. What module or modules do I use to grab directly data from the PM Newest Node XML Generator link to be able to process it, to be able to pick out the elements that I want? And can someone give me a strip-down sample of what the code for that task might look like? It seems that I've worked out what to do with the data once I get it in my script's possession as a straight XML document. I just don't understand what I need to do to get it that way from this XML generator link. Do I need to use something like XML::Parser or some other parser module to get the data and to have it put it in a format like the one my web browser so easily created? I'm looking for some sample code of the piece I'm missing, as well as an explanation of the concept piece I'm missing, to fill in the missing piece of my sanity. Thanks. -Spenser That's Spenser, with an "s" like the detective.	[reply]
Re: Retrieving Remote XML Data by lestrrat (Deacon) on Aug 28, 2002 at 00:52 UTC
What's the problem with XML::LibXML? I don't know anything about the PM XML, but you should be able to do just about anything with XML::LibXML `## untested code use strict; use XML::LibXML; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file( 'http://www.perlmonks.org/index.pl?n +ode_id=30175' ); ## find all "Discussion" section foreach my $disc_node ( $doc->findnodes( '/NEWESTNODES/NODE[ @nodet +ype="monkdiscuss" ]' ) ) { ## do something with it... }` [download]	[reply] [d/l]
Re: Retrieving Remote XML Data by Spenser (Friar) on Aug 28, 2002 at 17:01 UTC
There's an advantage to XPath in that one can easily extract specific nodes that have specific attributes and get the specific component (i.e., a node in xml format or text within a tag) that one wants in a string. Of course, you're probably right, if I knew more about what I'm doing I could work with just LibXML. Nevertheless, with the help of the lovely and talented Miss jarich, I managed to resolve the problem, or should I say, I patched together LibXML and XPath to accomplish what I wanted. Below is code added to the original node above that resolved the problem: `. . use XML::LibXML;# Plus the other modules above . . my $uri = "http://www.perlmonks.org/index.pl?node_id=30175&sinceunixti +me=20020820"; . # Parse out XML using LibXML to a string my $doc = $parser->parse_file($uri); my $root = $doc->getDocumentElement; my $file = $doc->toString; my $xpath = XML::XPath->new(xml=>"$file"); [...And resume the above code.]` [download] There are three key changes here. First, I had to change the URI address. The uri I started with was a little messy: I kept getting all kinds of extra HTML header stuff I couldn't skip over, causing it to crap out--I was probably doing it wrong, I admit. Incidentally, the unixtime value should be calculated by code--I'll tend to that later. The second key is to use the `toString` directive to save the matching XML tags and their text contents to `$file`. The last and most significant key was to change the last line of the code shown in this post from the "filename" format directive (`XML::XPath->new(filename=>$file)`)to the "xml" directive (`XML::XPath->new(xml=>"$file")`). So, LibXML gets the XML from the PM site and then saves it to a string for XPath to read and extracts the desired nodes for standard Perl to work with. I know this is the hard way to do it, but I'm still learning and would love for any one who so inclined to stream-line what I've done. Otherwise, thanks for all the feedback and advice. -Spenser That's Spenser, with an "s" like the detective.	[reply] [d/l] [select]
Re: Re: Retrieving Remote XML Data by lestrrat (Deacon) on Aug 28, 2002 at 22:10 UTC
um, I'm highly against that solution. You're loading XML::LibXML for fetching data only??? That's the wrong module to use. If you want to do that, use one of the LWP::* stuff On the other hand, I think you fail to realize the power that XML::LibXML gives to you. namely, It supports XPath notations!!! Did you read my earlier post? I do use some XPath notation to extract the "Discussion" nodes out of the XML.	[reply]