in reply to Convert & to & etc.

See HTML::Entities
hth,

Poolpi

Replies are listed 'Best First'.
Re^2: Convert & to & etc.
by loris (Hermit) on Feb 07, 2008 at 14:05 UTC

    Thanks, that works fine for the ampersands, but not for my umlauts. I assume this is because, say, ü is encoded not as ü, but as ü, whatever that is. Do you know what sort of encoding this is and how I can deal with it?

    Thanks,

    loris


    "It took Loris ten minutes to eat a satsuma . . . twenty minutes to get from one end of his branch to the other . . . and an hour to scratch his bottom. But Slow Loris didn't care. He had a secret . . ." (from "Slow Loris" by Alexis Deacon)
      When you parse websites you have to consult the HTTP headers (and perhaps the http-equiv meta tags) to find out which charset it is in.

      Then you can use Encode::decode to transform it into something useful.

      (Perhaps inspecting a hexdump of the string helps you to find out which charset it is in).

        Thanks for the advice.

        Unfortunately, there didn't seem to be any thing like charset or anything information about the encoding in the HTML. However, luckily (since I am a bit of an encoding wimp), it turned out that I just had to choose 'UTF8' instead of the default 'Windows ANSI' as 'file origin' when importing into Excel and everything was fine. Doh!

        loris


        "It took Loris ten minutes to eat a satsuma . . . twenty minutes to get from one end of his branch to the other . . . and an hour to scratch his bottom. But Slow Loris didn't care. He had a secret . . ." (from "Slow Loris" by Alexis Deacon)

      If you are using Spreadsheet::WriteExcel, you can use its functionality directly:

      use Spreadsheet::WriteExcel; use HTML::Entities; use Encode qw( from_to ); from_to (decode_entities ($value), "utf-8", "ucs2"); $wks->write_unicode ($column, $row, $value);

      Enjoy, Have FUN! H.Merijn