JukeBox has asked for the wisdom of the Perl Monks concerning the following question:

Hey Perl Monks, Just looking for some advice, I have a small script, that retrieves text from various websites and then consolidates them into a database. Now, some of the text on one of the pages has recently changed, now they seem to be using different html characters instead of the standard, For example:
“ instead of " and ’ instead of '
When I parse this data and attempt to put it into the database is comes out with:
It?~@~Ys going to be a scorcher with tomorrow?~@~Ys ?~@~\rainy day?~@~ +] parade. They?~@~Yre just....
Looks like it is some sort of encoding problem. How do I strip the string of these characters? or How do I re-encode them? *stumped* Thanks!

Replies are listed 'Best First'.
Re: Strange Characters - Different Encoding?
by davidrw (Prior) on Nov 05, 2005 at 02:41 UTC
    Can you post your small script? Can you post the relevant parts of the raw html you scrape? Specifically, if it specifies what encoding it is using and the exact html of the test in question to see how it has those characters. Also, what database is this? I think that the text you're showing us is from taking the XXX-type characters from the html and shoving it into your YYY database, so to figure out the problem and see where to put the solution, need to back up and separate out the steps...
      Thanks for your response david Here is some of the raw html.
      <br> Mraz is known for “Remedy (I Won’t Worry),” which was the first single + off his first album, “Waiting for My Rocket to Come.” His second alb +um, “Mr. A-Z,” was released in July. <br>
      I'm guessing that the special characters in the html are made from ascii characters. The script is so easy, its just an instance of LWP.
      my $content = get $url; die "Couldn't get $url" unless defined $content;
      An example url from above

      The $url varies between websites. At first, the output of the script is going into a plain txt file, but will be later imported into a mysql database (however I have not got that far yet - still trying to figure this part out)

      I then simply have some regular expressions to get correct text.

        The page in the URL above gives an encoding of iso8859-1, but contains some characters from the cp-1252 character set (specifically the quote signs, hex values 91-94, and minus sign, x96 and x97). Swap those out for ASCII characters and your problem should disappear:

        tr/\x93-\x94/\x22/; tr/\x91-\x92/\x27/; tr/\x96-\x97/\x2d/;

        Note: you should AFAIK be able to do this with the Encode or Text::Iconv modules instead of messing with the character values directly, but somehow this didn't work for me when trying it on the text (possibly because of the mixed encoding).

        Update: added minus sign.


        Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
Re: Strange Characters - Different Encoding?
by ioannis (Abbot) on Nov 05, 2005 at 09:47 UTC
    The page from the gwinnettdailypost site was in 8859-1 (Latin1) encoding -- good news, because it should be a simpler problem to investigate. First, assuming you are on Unix, ensure you are running perl from a Latin-1 environment.
    • % export LC_ALL=en_US.iso88591
    Then run your perl program.

    If the above was not sufficient to fix the problem, check if the db server is setup for latin-1 -- or create a new database that can handle Latin-1 .

    Here are the commands if you are using Postgresql:

    • SHOW server_encoding ;
    • CREATE DATABASE myname WITH encoding = 'LATIN1';
Re: Strange Characters - Different Encoding?
by JukeBox (Initiate) on Nov 05, 2005 at 08:56 UTC
    For future reference take a look at escapeHTML in CGI.pm, it is able to decode the relative character.