in reply to Re^2: Convert & to & etc.
in thread Convert & to & etc.

When you parse websites you have to consult the HTTP headers (and perhaps the http-equiv meta tags) to find out which charset it is in.

Then you can use Encode::decode to transform it into something useful.

(Perhaps inspecting a hexdump of the string helps you to find out which charset it is in).

Replies are listed 'Best First'.
Re^4: Convert & to & etc.
by loris (Hermit) on Feb 08, 2008 at 12:33 UTC

    Thanks for the advice.

    Unfortunately, there didn't seem to be any thing like charset or anything information about the encoding in the HTML. However, luckily (since I am a bit of an encoding wimp), it turned out that I just had to choose 'UTF8' instead of the default 'Windows ANSI' as 'file origin' when importing into Excel and everything was fine. Doh!

    loris


    "It took Loris ten minutes to eat a satsuma . . . twenty minutes to get from one end of his branch to the other . . . and an hour to scratch his bottom. But Slow Loris didn't care. He had a secret . . ." (from "Slow Loris" by Alexis Deacon)