Falstaff has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

First of all. I am starting in the world of Perl and I am completely in love with this language.

I am working in a simple script that reads a few rss addresses from a text file and then retrieves the items from those addresses. Most of the feeds provides items correctly but some throws warnings about wide characters and spits weird characters between the words.

Investigating about the issue I found that these warnings are produced usually by problems in the encoding. The RSS are XML encoded with utf-8. So I decided to use the Encode module with the decode methods. This was not a solution because the RAI module performs the decoding internally.

Here is the code:

#!/usr/bin/perl use warnings; use strict; use HTTP::Request; use LWP::UserAgent; use XML::RAI; use Data::Dumper; use Encode qw( decode ); my $ua = LWP::UserAgent->new; open IN, "in.txt"; open OUT, ">out.txt"; while (<IN>) { chomp; my $request = HTTP::Request->new( GET => $_ ); print "Requesting...\n"; my $response = $ua->request( $request ); print " Status: ", $response->status_line, "\n"; print " Last modified: ", $response->header( 'last-modified' ), "\n"; print " Etag: ", $response->header( 'etag' ), "\n\n"; my $response_content = decode('UTF-8',$response->content); my $rai = XML::RAI->parse_string( $response->content ); my $channel = $rai->channel; print "Channel:\n"; print " Title: " . $channel->title . "\n"; print " Link: " . $channel->link . "\n"; print " Modified: " . $channel->modified . "\n"; print " Publisher: " . $channel->publisher . "\n"; for ( @{$rai->items} ) { #my $descriptiond = decode( 'UTF-8', $_->description ); print OUT "Item:\n"; print OUT " Title: " . $_->title . "\n"; print OUT " Link: " . $_->link . "\n"; print OUT " Description: " . $_->description . "\n"; print OUT " Created: " . $_->created . "\n"; print OUT "------------------------------------------------\n\n"; } $request->header( 'If-Modified-Since', $response->header( 'last-modifi +ed' ) ); $request->header( 'If-None-Match', $response->header( 'etag' ) ); }

The feed address that throws items with problems are:

http://estaticos.elmundo.es/elmundo/rss/portada.xml http://www.publico.es/rss/

What do I call weird characters?

Title: Las pelĂ­culas que fuimos... y que seguiremos siendo

Can you give a hint to solve the problem? Many thanks in advance.

Replies are listed 'Best First'.
Re: Weird behavior with RAI and RSS.
by Khen1950fx (Canon) on Jan 17, 2011 at 01:31 UTC
    In addition to missing binmode, I found a couple of other problems. First, don't forget to close an open filehandle. Second, make sure that everything is initialised. In your script, "publisher" wasn't initialised; also, you were using an unopened filehandle "OUT"---don't do that:). Here's what worked for me:
    #!/usr/bin/perl use strict; use warnings; use Encode qw(decode); use HTTP::Request; use LWP::UserAgent; use XML::RAI; my $ua = 'LWP::UserAgent'->new; my $rss = shift @ARGV; binmode STDOUT, ":utf8"; open STDERR, '>', 'in.err'; open STDOUT, '>', 'out.log'; my $request = 'HTTP::Request'->new( 'GET', $rss ); print "Requesting...\n"; my $response = $ua->request($request); print ' Status: ', $response->status_line, "\n"; print ' Last modified: ', $response->header('last-modified'), "\n"; print ' Etag: ', $response->header('etag'), "\n\n"; my $response_content = decode( 'UTF-8', $response->content ); my $rai = 'XML::RAI'->parse_string( $response->content ); my $channel = $rai->channel; print "Channel:\n"; print ' Title: ' . $channel->title . "\n"; print ' Link: ' . $channel->link . "\n"; print ' Modified: ' . $channel->modified . "\n"; foreach $_ ( @{ $rai->items; } ) { print STDOUT "Item:\n"; print STDOUT ' Title: ' . $_->title . "\n"; print STDOUT ' Link: ' . $_->link . "\n"; print STDOUT ' Description: ' . $_->description . "\n"; print STDOUT ' Created: ' . $_->created . "\n"; print STDOUT "------------------------------------------------\n\n +"; } $request->header( 'If-Modified-Since', $response->header('last-modifie +d') ); $request->header( 'If-None-Match', $response->header('etag') ); close STDERR; close STDOUT;
Re: Weird behavior with RAI and RSS.
by Anonymous Monk on Jan 16, 2011 at 23:00 UTC
Re: Weird behavior with RAI and RSS.
by Chemtox (Initiate) on Jan 18, 2011 at 03:35 UTC

    Hola,

    by now you probably realised what your problem was, but to give a more general answer: perl handling of encoding is not as transparent as it could be, to keep backwards compatibility; but as long as you specify the right codification for all the links the chain, (e.g. your input, output, and output viewer --your terminal must have the proper locale, your text editor or browser must know the proper encoding), the chain won't break... often. Today, that usually means "set everything to UTF-8," though many programs haven't kept with the times.

    For a gentle introduction to encoding, check perlunitut and perlunifaq. Also, "use diagnostics;" will provide clear help to many common mistakes; more than recommended when you're starting.

      Many thanks for your replies.

      As you probably have noticed, my coding in Perl is still erratic. This is because I am in my first steps. When you have worked many years with C++ and C# you assume some things that are not always correct.

      Anyway. I am delight to see the helpful and full of kindness that Perl has.