I am afraid I do not have the luxury to discard all non-utf8 input, but I can simplify the code:
if the input is not detected as utf8, just treat it as iso-8859-1
use Text::Unaccent; use Encode::Detect::Detector; # my $author = "Sch%F6%E5ttl"; # my $author = "Sch%C3%A9ttl"; # my $author = "Sch%C3%B6ttl"; # my $author = "Sch%F6%F6ttl"; # my $author = "Sch%F6 %F4ttl"; my $author = "teoria elasticit%E0"; $author =~ s/%([a-zA-Z0-9][a-zA-Z0-9])/pack('C',hex($1))/eg; my $encoding = Encode::Detect::Detector::detect($author); if($encoding !~ m#utf-8#i){ $encoding = "iso-8859-1"; } if($encoding){ $author = unac_string($encoding, $author); print "after unac: $author<br>\n"; }
Seems like it's working better, any potential problem?
In reply to Re: Perl detect utf8, iso-8859-1 encoding
by swiftlet
in thread Perl detect utf8, iso-8859-1 encoding
by swiftlet
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |