Re: Unicode to HTML code &#....;

First, let's clear some confusion. Unicode doesn't specify how characters are stored, so you can't possible be talking about Unicode when you're talking about a string of bytes. It looks like you meant UTF-8 when you said Unicode. UTF-8 is a means of representing (encoding) Unicode characters in bytes.

$string =~ s/([^a-zA-Z0-9])/'&#'.unpack('U0U*',$1).';'/eg;
[download]

can also be written as

use HTML::Entites qw( encode_entities );
$string = encode_entities($string);
[download]

and

use Encode qw( encode );
$string = encode('US-ASCII', $string, Encode::FB_HTMLCREF);
[download]

No need to reinvent the wheel.

If you use the latter, you can combine the decoding and encoding into one step.

use Encode qw( from_to );

sub unicode_decode
{
  my $string = shift;
  from_to($string, 'UTF-8', 'US-ASCII', Encode::FB_HTMLCREF);
  return($string);
}
[download]

Comment on Re: Unicode to HTML code &#....; Select or Download Code

Replies are listed 'Best First'.
Re^2: Unicode to HTML code &#....; by Forlix (Novice) on Nov 15, 2008 at 19:57 UTC
Thanks to both of you. The thing is, $string must not contain certain characters like comma and slash, since I use those as separators in my text files. Thats what the regex also ensures, so I think its still the best choice here given the circumstances. So I now go with `use Encode qw(decode); sub unicode_decode { my $string = decode('utf8', shift, 0); $string =~ tr/\x{FFFD}/\x20/; $string =~ s/([^a-zA-Z0-9\_\+\-\.])/'&#'.unpack('U0U*',$1).';'/eg; return($string); }` [download] As you can see, this also swaps the replacement character with a space should there be one.	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: Unicode to HTML code &#....;
by Forlix (Novice) on Nov 15, 2008 at 19:57 UTC

use Encode qw(decode);

sub unicode_decode
{
  my $string = decode('utf8', shift, 0);
  $string =~ tr/\x{FFFD}/\x20/;
  $string =~ s/([^a-zA-Z0-9\_\+\-\.])/'&#'.unpack('U0U*',$1).';'/eg;
  return($string);
}
[download]

[reply]
[d/l]