in reply to Unicode to HTML code &#....;

First, let's clear some confusion. Unicode doesn't specify how characters are stored, so you can't possible be talking about Unicode when you're talking about a string of bytes. It looks like you meant UTF-8 when you said Unicode. UTF-8 is a means of representing (encoding) Unicode characters in bytes.

$string =~ s/([^a-zA-Z0-9])/'&#'.unpack('U0U*',$1).';'/eg;

can also be written as

use HTML::Entites qw( encode_entities ); $string = encode_entities($string);

and

use Encode qw( encode ); $string = encode('US-ASCII', $string, Encode::FB_HTMLCREF);

No need to reinvent the wheel.

If you use the latter, you can combine the decoding and encoding into one step.

use Encode qw( from_to ); sub unicode_decode { my $string = shift; from_to($string, 'UTF-8', 'US-ASCII', Encode::FB_HTMLCREF); return($string); }

Replies are listed 'Best First'.
Re^2: Unicode to HTML code &#....;
by Forlix (Novice) on Nov 15, 2008 at 19:57 UTC
    Thanks to both of you.
    The thing is, $string must not contain certain characters like comma and slash, since I use those as separators in my text files. Thats what the regex also ensures, so I think its still the best choice here given the circumstances.
    So I now go with
    use Encode qw(decode); sub unicode_decode { my $string = decode('utf8', shift, 0); $string =~ tr/\x{FFFD}/\x20/; $string =~ s/([^a-zA-Z0-9\_\+\-\.])/'&#'.unpack('U0U*',$1).';'/eg; return($string); }
    As you can see, this also swaps the replacement character with a space should there be one.