Falstaff has asked for the wisdom of the Perl Monks concerning the following question:
First of all. I am starting in the world of Perl and I am completely in love with this language.
I am working in a simple script that reads a few rss addresses from a text file and then retrieves the items from those addresses. Most of the feeds provides items correctly but some throws warnings about wide characters and spits weird characters between the words.
Investigating about the issue I found that these warnings are produced usually by problems in the encoding. The RSS are XML encoded with utf-8. So I decided to use the Encode module with the decode methods. This was not a solution because the RAI module performs the decoding internally.
Here is the code:
#!/usr/bin/perl use warnings; use strict; use HTTP::Request; use LWP::UserAgent; use XML::RAI; use Data::Dumper; use Encode qw( decode ); my $ua = LWP::UserAgent->new; open IN, "in.txt"; open OUT, ">out.txt"; while (<IN>) { chomp; my $request = HTTP::Request->new( GET => $_ ); print "Requesting...\n"; my $response = $ua->request( $request ); print " Status: ", $response->status_line, "\n"; print " Last modified: ", $response->header( 'last-modified' ), "\n"; print " Etag: ", $response->header( 'etag' ), "\n\n"; my $response_content = decode('UTF-8',$response->content); my $rai = XML::RAI->parse_string( $response->content ); my $channel = $rai->channel; print "Channel:\n"; print " Title: " . $channel->title . "\n"; print " Link: " . $channel->link . "\n"; print " Modified: " . $channel->modified . "\n"; print " Publisher: " . $channel->publisher . "\n"; for ( @{$rai->items} ) { #my $descriptiond = decode( 'UTF-8', $_->description ); print OUT "Item:\n"; print OUT " Title: " . $_->title . "\n"; print OUT " Link: " . $_->link . "\n"; print OUT " Description: " . $_->description . "\n"; print OUT " Created: " . $_->created . "\n"; print OUT "------------------------------------------------\n\n"; } $request->header( 'If-Modified-Since', $response->header( 'last-modifi +ed' ) ); $request->header( 'If-None-Match', $response->header( 'etag' ) ); }
The feed address that throws items with problems are:
http://estaticos.elmundo.es/elmundo/rss/portada.xml http://www.publico.es/rss/
What do I call weird characters?
Can you give a hint to solve the problem? Many thanks in advance.Title: Las pelĂculas que fuimos... y que seguiremos siendo
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Weird behavior with RAI and RSS.
by Khen1950fx (Canon) on Jan 17, 2011 at 01:31 UTC | |
|
Re: Weird behavior with RAI and RSS.
by Anonymous Monk on Jan 16, 2011 at 23:00 UTC | |
|
Re: Weird behavior with RAI and RSS.
by Chemtox (Initiate) on Jan 18, 2011 at 03:35 UTC | |
by Falstaff (Initiate) on Jan 18, 2011 at 21:50 UTC |