in reply to Re: Strange Characters - Different Encoding?
in thread Strange Characters - Different Encoding?

Thanks for your response david Here is some of the raw html.
<br> Mraz is known for “Remedy (I Won’t Worry),” which was the first single + off his first album, “Waiting for My Rocket to Come.” His second alb +um, “Mr. A-Z,” was released in July. <br>
I'm guessing that the special characters in the html are made from ascii characters. The script is so easy, its just an instance of LWP.
my $content = get $url; die "Couldn't get $url" unless defined $content;
An example url from above

The $url varies between websites. At first, the output of the script is going into a plain txt file, but will be later imported into a mysql database (however I have not got that far yet - still trying to figure this part out)

I then simply have some regular expressions to get correct text.

Replies are listed 'Best First'.
Re^3: Strange Characters - Different Encoding?
by tirwhan (Abbot) on Nov 05, 2005 at 10:22 UTC

    The page in the URL above gives an encoding of iso8859-1, but contains some characters from the cp-1252 character set (specifically the quote signs, hex values 91-94, and minus sign, x96 and x97). Swap those out for ASCII characters and your problem should disappear:

    tr/\x93-\x94/\x22/; tr/\x91-\x92/\x27/; tr/\x96-\x97/\x2d/;

    Note: you should AFAIK be able to do this with the Encode or Text::Iconv modules instead of messing with the character values directly, but somehow this didn't work for me when trying it on the text (possibly because of the mixed encoding).

    Update: added minus sign.


    Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
      Thanks for your help tirwhan, much appreciated.