Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML by ww (Archbishop) on Jun 24, 2010 at 23:22 UTC
PM norms (well documented in PerlMonks FAQ and Guide to the Monastery, q.v.) Have you read the docs for XML::RSS::LibXML? What have you tried? Where's a (failing) subset of your code to illustrate your problem? Where's some sample data? Again, please see: On asking for help, How do I post a question effectively?, I know what I mean. Why don't you?, and Writeup Formatting Tips.	[reply]
Re^2: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML by mr_p (Scribe) on Jun 25, 2010 at 17:07 UTC
Here is my code. The TODO part is where I have to put my code. But everything I read for XML::RSS::LibXML doesn't let me pull out "item" tag. #!/usr/bin/perl use File::Path; use Data::Dumper; use LWP::UserAgent; use XML::RSS::LibXML; use POSIX qw(strftime); use Time::HiRes qw(gettimeofday tv_interval); my $client = LWP::UserAgent->new(); my ($fh, $feed, $feed_title, $count, $node); my $rss = XML::RSS::LibXML->new; my $website_name = "usnews"; my $url = "http://www.usnews.com/rss/health-news/index.rss"; $firstListing = 1; while (1) { if ( $website_name eq "" ) { next; }; print "polling: $website_name url: $url\n"; $capture = $client->get("$url", ":content_file" => "/tmp/.rss_down +load_file") \|\| die"$!\n"; $rss->parsefile('/tmp/.rss_download_file'); print "channel: $rss->{channel}->{title}\n"; @curListOfItems = (); foreach my $item (@{ $rss->{items} }) { my $node_link = $item->{link}; if (defined $node_link) { $curItem=$node_link ."\n"; push (@curListOfItems, $curItem); } } if ($#prevListOfItems != -1 ) { # @newlyAddedLinks will be latest in curListOfItems and not in + @prevListOfItems @newlyAddedLinks=grep!${{map{$_,1}@prevListOfItems}}{$_},@curL +istOfItems; foreach my $l (@newlyAddedLinks) { my $fileName=getFileName(); $fileName="/tmp/.$website_name\_${fileName}"; my $capture = $client->get("$l", ":content_file" => "$file +Name"); # TODO: Pull out the current Item tag ( <item> .....</item +> ) } print "Getting1 $filename\n"; } elsif ( $firstListing == 1) { print "Getting2 $filename\n"; foreach my $l (@curListOfItems) { my $fileName=getFileName(); $fileName="/tmp/.$website_name\_${fileName}"; my $capture = $client->get("$l", ":content_file" => "$file +Name"); # TODO: Pull out the current Item tag ( <item> .....</item +> ) } $firstListing = 0; } @prevListOfItems = @curListOfItems; open OUT_FILE, "> /tmp/.$website_name" \|\| die "could not open file + $!"; print OUT_FILE "@prevListOfItems"; close OUT_FILE; sleep 1; } sub getFileName { my ($seconds, $microseconds) = gettimeofday(); my $padded_usecs = sprintf ('%06d', $microseconds); my ($logType, $str1, $str2) = split ('\\|',$LogElement); $todaysDate = strftime "%d", localtime; $currentDateTime = strftime "%Y:%m:%d:%H:%M:%S", localtime; ($Year,$Month,$Date,$Hour,$Minute,$Seconds) = split /:/, $currentD +ateTime; $curYear = sprintf ('%04d', $Year); $curMonth = sprintf ('%02d', $Month); $curHour = sprintf ('%02d', $Hour); $curMinute = sprintf ('%02d', $Minute); $curDate = sprintf ('%02d', $Date); $curSec = sprintf ('%02d', $Seconds); my $fname = "${curYear}${curMonth}${curMonth}${curHour}${curMinute +}${curSec}.html"; return "$fname"; } [download]	[reply] [d/l]
Re^3: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML by rowdog (Curate) on Jun 26, 2010 at 18:12 UTC
So, you're parsing an RSS file and then, for each item, you are fetching the link. What you get back is HTML not RSS so no, I don't think you'll get far trying to process the links with XML::RSS::LibXML. I'm not sure what you plan to do with the HTML documents but you already have XML::LibXML loaded into RAM so you could use it to parse the HTML: `use XML::LibXML; my $dom = XML::LibXML->load_html( location => $fileName, recover => 1, # handle marginal HTML ); print $dom->toString;` [download] The parser options for `load_html` are documented in XML::LibXML::Parser.	[reply] [d/l] [select]
Re^4: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML by mr_p (Scribe) on Jun 27, 2010 at 02:38 UTC
Re^5: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML by rowdog (Curate) on Jun 28, 2010 at 19:07 UTC
Some notes below your chosen depth have not been shown here
Re^2: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML by mr_p (Scribe) on Jun 26, 2010 at 00:57 UTC
I am very surprised no one has answer this question. Is it possible to do, what I am trying to do?	[reply]