in reply to Strange Characters - Different Encoding?

Can you post your small script? Can you post the relevant parts of the raw html you scrape? Specifically, if it specifies what encoding it is using and the exact html of the test in question to see how it has those characters. Also, what database is this? I think that the text you're showing us is from taking the XXX-type characters from the html and shoving it into your YYY database, so to figure out the problem and see where to put the solution, need to back up and separate out the steps...
  • Comment on Re: Strange Characters - Different Encoding?

Replies are listed 'Best First'.
Re^2: Strange Characters - Different Encoding?
by JukeBox (Initiate) on Nov 05, 2005 at 07:53 UTC
    Thanks for your response david Here is some of the raw html.
    <br> Mraz is known for “Remedy (I Won’t Worry),” which was the first single + off his first album, “Waiting for My Rocket to Come.” His second alb +um, “Mr. A-Z,” was released in July. <br>
    I'm guessing that the special characters in the html are made from ascii characters. The script is so easy, its just an instance of LWP.
    my $content = get $url; die "Couldn't get $url" unless defined $content;
    An example url from above

    The $url varies between websites. At first, the output of the script is going into a plain txt file, but will be later imported into a mysql database (however I have not got that far yet - still trying to figure this part out)

    I then simply have some regular expressions to get correct text.

      The page in the URL above gives an encoding of iso8859-1, but contains some characters from the cp-1252 character set (specifically the quote signs, hex values 91-94, and minus sign, x96 and x97). Swap those out for ASCII characters and your problem should disappear:

      tr/\x93-\x94/\x22/; tr/\x91-\x92/\x27/; tr/\x96-\x97/\x2d/;

      Note: you should AFAIK be able to do this with the Encode or Text::Iconv modules instead of messing with the character values directly, but somehow this didn't work for me when trying it on the text (possibly because of the mixed encoding).

      Update: added minus sign.


      Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
        Thanks for your help tirwhan, much appreciated.