Hello all,

First of all. I am starting in the world of Perl and I am completely in love with this language.

I am working in a simple script that reads a few rss addresses from a text file and then retrieves the items from those addresses. Most of the feeds provides items correctly but some throws warnings about wide characters and spits weird characters between the words.

Investigating about the issue I found that these warnings are produced usually by problems in the encoding. The RSS are XML encoded with utf-8. So I decided to use the Encode module with the decode methods. This was not a solution because the RAI module performs the decoding internally.

Here is the code:

#!/usr/bin/perl use warnings; use strict; use HTTP::Request; use LWP::UserAgent; use XML::RAI; use Data::Dumper; use Encode qw( decode ); my $ua = LWP::UserAgent->new; open IN, "in.txt"; open OUT, ">out.txt"; while (<IN>) { chomp; my $request = HTTP::Request->new( GET => $_ ); print "Requesting...\n"; my $response = $ua->request( $request ); print " Status: ", $response->status_line, "\n"; print " Last modified: ", $response->header( 'last-modified' ), "\n"; print " Etag: ", $response->header( 'etag' ), "\n\n"; my $response_content = decode('UTF-8',$response->content); my $rai = XML::RAI->parse_string( $response->content ); my $channel = $rai->channel; print "Channel:\n"; print " Title: " . $channel->title . "\n"; print " Link: " . $channel->link . "\n"; print " Modified: " . $channel->modified . "\n"; print " Publisher: " . $channel->publisher . "\n"; for ( @{$rai->items} ) { #my $descriptiond = decode( 'UTF-8', $_->description ); print OUT "Item:\n"; print OUT " Title: " . $_->title . "\n"; print OUT " Link: " . $_->link . "\n"; print OUT " Description: " . $_->description . "\n"; print OUT " Created: " . $_->created . "\n"; print OUT "------------------------------------------------\n\n"; } $request->header( 'If-Modified-Since', $response->header( 'last-modifi +ed' ) ); $request->header( 'If-None-Match', $response->header( 'etag' ) ); }

The feed address that throws items with problems are:

http://estaticos.elmundo.es/elmundo/rss/portada.xml http://www.publico.es/rss/

What do I call weird characters?

Title: Las pelĂ­culas que fuimos... y que seguiremos siendo

Can you give a hint to solve the problem? Many thanks in advance.

In reply to Weird behavior with RAI and RSS. by Falstaff

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.