in reply to Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

PM norms (well documented in PerlMonks FAQ and Guide to the Monastery, q.v.)

  1. Have you read the docs for XML::RSS::LibXML?
  2. What have you tried?
  3. Where's a (failing) subset of your code to illustrate your problem?
  4. Where's some sample data?

Again, please see: On asking for help, How do I post a question effectively?, I know what I mean. Why don't you?, and Writeup Formatting Tips.

  • Comment on Re: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

Replies are listed 'Best First'.
Re^2: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
by mr_p (Scribe) on Jun 25, 2010 at 17:07 UTC
    Here is my code.

    The TODO part is where I have to put my code. But everything I read for XML::RSS::LibXML doesn't let me pull out "item" tag.

    #!/usr/bin/perl use File::Path; use Data::Dumper; use LWP::UserAgent; use XML::RSS::LibXML; use POSIX qw(strftime); use Time::HiRes qw(gettimeofday tv_interval); my $client = LWP::UserAgent->new(); my ($fh, $feed, $feed_title, $count, $node); my $rss = XML::RSS::LibXML->new; my $website_name = "usnews"; my $url = "http://www.usnews.com/rss/health-news/index.rss"; $firstListing = 1; while (1) { if ( $website_name eq "" ) { next; }; print "polling: $website_name url: $url\n"; $capture = $client->get("$url", ":content_file" => "/tmp/.rss_down +load_file") || die"$!\n"; $rss->parsefile('/tmp/.rss_download_file'); print "channel: $rss->{channel}->{title}\n"; @curListOfItems = (); foreach my $item (@{ $rss->{items} }) { my $node_link = $item->{link}; if (defined $node_link) { $curItem=$node_link ."\n"; push (@curListOfItems, $curItem); } } if ($#prevListOfItems != -1 ) { # @newlyAddedLinks will be latest in curListOfItems and not in + @prevListOfItems @newlyAddedLinks=grep!${{map{$_,1}@prevListOfItems}}{$_},@curL +istOfItems; foreach my $l (@newlyAddedLinks) { my $fileName=getFileName(); $fileName="/tmp/.$website_name\_${fileName}"; my $capture = $client->get("$l", ":content_file" => "$file +Name"); # TODO: Pull out the current Item tag ( <item> .....</item +> ) } print "Getting1 $filename\n"; } elsif ( $firstListing == 1) { print "Getting2 $filename\n"; foreach my $l (@curListOfItems) { my $fileName=getFileName(); $fileName="/tmp/.$website_name\_${fileName}"; my $capture = $client->get("$l", ":content_file" => "$file +Name"); # TODO: Pull out the current Item tag ( <item> .....</item +> ) } $firstListing = 0; } @prevListOfItems = @curListOfItems; open OUT_FILE, "> /tmp/.$website_name" || die "could not open file + $!"; print OUT_FILE "@prevListOfItems"; close OUT_FILE; sleep 1; } sub getFileName { my ($seconds, $microseconds) = gettimeofday(); my $padded_usecs = sprintf ('%06d', $microseconds); my ($logType, $str1, $str2) = split ('\|',$LogElement); $todaysDate = strftime "%d", localtime; $currentDateTime = strftime "%Y:%m:%d:%H:%M:%S", localtime; ($Year,$Month,$Date,$Hour,$Minute,$Seconds) = split /:/, $currentD +ateTime; $curYear = sprintf ('%04d', $Year); $curMonth = sprintf ('%02d', $Month); $curHour = sprintf ('%02d', $Hour); $curMinute = sprintf ('%02d', $Minute); $curDate = sprintf ('%02d', $Date); $curSec = sprintf ('%02d', $Seconds); my $fname = "${curYear}${curMonth}${curMonth}${curHour}${curMinute +}${curSec}.html"; return "$fname"; }

      So, you're parsing an RSS file and then, for each item, you are fetching the link. What you get back is HTML not RSS so no, I don't think you'll get far trying to process the links with XML::RSS::LibXML.

      I'm not sure what you plan to do with the HTML documents but you already have XML::LibXML loaded into RAM so you could use it to parse the HTML:

      use XML::LibXML; my $dom = XML::LibXML->load_html( location => $fileName, recover => 1, # handle marginal HTML ); print $dom->toString;

      The parser options for load_html are documented in XML::LibXML::Parser.

        I just keep polling for the same rss feed and compare the Link from last poll and when ever new Links or Item has been added I only want to parse the Item tag of the New Item.
Re^2: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
by mr_p (Scribe) on Jun 26, 2010 at 00:57 UTC
    I am very surprised no one has answer this question. Is it possible to do, what I am trying to do?