Extracting elements from an XML chunk leads to crash

Hue-Bond has asked for the wisdom of the Perl Monks concerning the following question:

Wise monks,

Been a long time without posting here :^). I'm creating an RSS 2.0 feed. The code checks for new items from a forum and turns them into a proper XML file. We're planning to do this once an hour or so. Then I discovered that Google Reader (my RSS client) only polls the feed every three hours. This means that if I don't keep items between runs of the program, Google Reader will only catch the items posted to the forum in the last hour, so to avoid this I designed the following plan:

Read previously saved items (XML chunks) into a hash and delete old ones. The hash key is each item's <title> tag, like in $h{$title} = $xml_chunk.
Transform the original HTML into a set of RSS <item>s. Add them to the hash, overwriting those with the same <title> (these are new messages posted to an existing thread).
Serialize the hash into a string of items (by just concatenating all elements).
Save this string for later runs.
Add the XML needed to change the string of <item>s into a real RSS feed (i.e. <rss>, <channel> and so on).

If you think there's a better way to approach this problem, please stop reading now ;^).

I'm getting a segmentation fault when extracting the title and the date from an item. This code triggers it:

#!/usr/bin/perl
## the version of XML::LibXML installed in the system is 1.63, so I tr
+ied installing 1.65 here:
BEGIN { unshift @INC, '/tmp/pm/lib/perl/5.8.8'; }
use warnings;
use strict;
use XML::LibXML;

print "Version: ", $XML::LibXML::VERSION, "\n";
my $parser = XML::LibXML->new;
my $item = $parser->parse_balanced_chunk (<<'EOT');
<item>
<title>Insert title here</title>
<link>http://foo/bar.html</link>
<description>Insert description here</description>
<guid isPermaLink="false">foobar@19700101:000000+0000</guid>
</item>
EOT
my $title = $item->findvalue ('//title');
print "Still alive!\n";
my $date = $item->findvalue ('//guid');
print "Still alive!\n";
print "got title ($title) and date ($date)\n";
[download]

$ ./foo.pl
Version: 1.65
Still alive!
Segmentation fault (core dumped)
[download]

I discovered this in a production system with libxml2 version 2.6.27, then I investigated it in my laptop with 2.6.30. Same result in both machines.

--
David Serrano

Comment on Extracting elements from an XML chunk leads to crash Select or Download Code

Replies are listed 'Best First'.
Re: Extracting elements from an XML chunk leads to crash by erroneousBollock (Curate) on Oct 22, 2007 at 00:47 UTC
Your sample doesn't segfault for me (WinXP, Perl 5.8.8, AS-817), but it also doesn't produce any useful value from the second XPath expression either... no matter what the expression is or the document contains. Don't you need a call to `XML::LibXML::XPathContext->new` there somewhere, passing the fragment node created by parse_balanced_chunk? Or is there some cute shorthand for the xpath context you can use directly from the doc/frag node? -David	[reply] [d/l]
Re: Extracting elements from an XML chunk leads to crash by Cody Pendant (Prior) on Oct 22, 2007 at 04:09 UTC
I think I'm in the "if you see an easier way" camp. You seem to be working too hard. An RSS feed isn't designed to be over-written completely every time you update it. That's why items have a GUID. Just add your new items, and leave the old ones there too, for as long as you want. Obviously the file will get too big if you leave them there for weeks, but you seem to be addressing the wrong problem -- "a certain RSS reader has a certain lookup rate and I need to match that" -- RSS is designed for agents to come and look at the feed whenever it suits them, and to figure out what's new for themselves, based on pubDates and GUIDs. Nobody says perl looks like line-noise any more kids today don't know what line-noise IS ...	[reply]
Re^2: Extracting elements from an XML chunk leads to crash by Hue-Bond (Priest) on Oct 24, 2007 at 08:36 UTC
Don't you need a call to `XML::LibXML::XPathContext->new` there somewhere, passing the fragment node created by parse_balanced_chunk? Honestly I don't know, since I'm beginning with all this `XML::LibXML` stuff. Just add your new items, and leave the old ones there too, for as long as you want. [...] "a certain RSS reader has a certain lookup rate and I need to match that" Of course I wasn't going to match Google Reader's polling time, but take the opportunity to implement an expiry time of a couple of weeks or so. I just discovered I can use a DOM-like interface, and quickly went on this road. However I found another segfault condition. I'm going to notify the module's author but first I would like to share the offending code with you monks, so I don't send them a suboptimal snippet. This is it: use warnings; use strict; use XML::LibXSLT; use XML::LibXML; my $parser = XML::LibXML->new; my $xslt = XML::LibXSLT->new; ## original web page my $xml_src = $parser->parse_html_string (<<'EOT'); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>foo</title> </head> <body> <p>foo</p> </body> </html> EOT ## which we will "transform" my $stylesheet = $xslt->parse_stylesheet ($parser->parse_string (<<'EO +T')); <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" indent="yes" xmlns:xsl="http://www.w3.or +g/1999/XSL/Transform"> <xsl:template match="/"> <channel> <item> <title>one item</title> </item> </channel> </xsl:template> </xsl:stylesheet> EOT my $parsed = $stylesheet->transform ($xml_src); ## we'll move this item to another document my $item = ($parsed->getElementsByTagName ('item'))[0]; $item->unbindNode; ## this is the other document: my $saved = $parser->parse_string (<<'EOT'); <?xml version="1.0"?> <channel> <item> <title>other item</title> </item> </channel> EOT my $ch = ($saved->findnodes ('/channel'))[0]; my $addto = ($ch->findnodes ('item'))[0]; $ch->insertBefore ($item, $addto); print "going to boom...\n"; END { print "unreached\n"; } [download] In particular I'm concerned about those `(findnodes)[0]`. I tried a couple of other approaches but didn't find anything that returned only the first element. The XSLT transformation does nothing interesting but is actually needed to crash the program. -- David Serrano	[reply] [d/l] [select]
Re: Extracting elements from an XML chunk leads to crash by Krambambuli (Curate) on Oct 22, 2007 at 10:11 UTC
Your code should work - there is no reason AFAIK that would make a first call to findvalue() pass and the second fail. Same weirdness here with libxml2 2.6.29 and Lib::XML 1.63. I'd say you'd better try to contact the module maintainer[s]. Krambambuli --- Enjoying Mark Jason Dominus' Higher-Order Perl	[reply]