mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I want to extract "item" tag from RSS feed and the feed does not contain links it just contains content. Is there a way to do this via XML::RSS::LibXML? Is there a way to get "Item" Tag with data. I am trying to avoid loading the document again with other module.

Thanks.

  • Comment on Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

Replies are listed 'Best First'.
Re: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
by ww (Archbishop) on Jun 24, 2010 at 23:22 UTC
      Here is my code.

      The TODO part is where I have to put my code. But everything I read for XML::RSS::LibXML doesn't let me pull out "item" tag.

      #!/usr/bin/perl use File::Path; use Data::Dumper; use LWP::UserAgent; use XML::RSS::LibXML; use POSIX qw(strftime); use Time::HiRes qw(gettimeofday tv_interval); my $client = LWP::UserAgent->new(); my ($fh, $feed, $feed_title, $count, $node); my $rss = XML::RSS::LibXML->new; my $website_name = "usnews"; my $url = "http://www.usnews.com/rss/health-news/index.rss"; $firstListing = 1; while (1) { if ( $website_name eq "" ) { next; }; print "polling: $website_name url: $url\n"; $capture = $client->get("$url", ":content_file" => "/tmp/.rss_down +load_file") || die"$!\n"; $rss->parsefile('/tmp/.rss_download_file'); print "channel: $rss->{channel}->{title}\n"; @curListOfItems = (); foreach my $item (@{ $rss->{items} }) { my $node_link = $item->{link}; if (defined $node_link) { $curItem=$node_link ."\n"; push (@curListOfItems, $curItem); } } if ($#prevListOfItems != -1 ) { # @newlyAddedLinks will be latest in curListOfItems and not in + @prevListOfItems @newlyAddedLinks=grep!${{map{$_,1}@prevListOfItems}}{$_},@curL +istOfItems; foreach my $l (@newlyAddedLinks) { my $fileName=getFileName(); $fileName="/tmp/.$website_name\_${fileName}"; my $capture = $client->get("$l", ":content_file" => "$file +Name"); # TODO: Pull out the current Item tag ( <item> .....</item +> ) } print "Getting1 $filename\n"; } elsif ( $firstListing == 1) { print "Getting2 $filename\n"; foreach my $l (@curListOfItems) { my $fileName=getFileName(); $fileName="/tmp/.$website_name\_${fileName}"; my $capture = $client->get("$l", ":content_file" => "$file +Name"); # TODO: Pull out the current Item tag ( <item> .....</item +> ) } $firstListing = 0; } @prevListOfItems = @curListOfItems; open OUT_FILE, "> /tmp/.$website_name" || die "could not open file + $!"; print OUT_FILE "@prevListOfItems"; close OUT_FILE; sleep 1; } sub getFileName { my ($seconds, $microseconds) = gettimeofday(); my $padded_usecs = sprintf ('%06d', $microseconds); my ($logType, $str1, $str2) = split ('\|',$LogElement); $todaysDate = strftime "%d", localtime; $currentDateTime = strftime "%Y:%m:%d:%H:%M:%S", localtime; ($Year,$Month,$Date,$Hour,$Minute,$Seconds) = split /:/, $currentD +ateTime; $curYear = sprintf ('%04d', $Year); $curMonth = sprintf ('%02d', $Month); $curHour = sprintf ('%02d', $Hour); $curMinute = sprintf ('%02d', $Minute); $curDate = sprintf ('%02d', $Date); $curSec = sprintf ('%02d', $Seconds); my $fname = "${curYear}${curMonth}${curMonth}${curHour}${curMinute +}${curSec}.html"; return "$fname"; }

        So, you're parsing an RSS file and then, for each item, you are fetching the link. What you get back is HTML not RSS so no, I don't think you'll get far trying to process the links with XML::RSS::LibXML.

        I'm not sure what you plan to do with the HTML documents but you already have XML::LibXML loaded into RAM so you could use it to parse the HTML:

        use XML::LibXML; my $dom = XML::LibXML->load_html( location => $fileName, recover => 1, # handle marginal HTML ); print $dom->toString;

        The parser options for load_html are documented in XML::LibXML::Parser.

      I am very surprised no one has answer this question. Is it possible to do, what I am trying to do?