in reply to Text File Encoding under Windows
I need to see specific examples to provide a complete answer. There are some things I would suggest. Don't trust text editors or the display when you view and print characters. If something strange is going on, then print out ordinal values ord($char). This will give you numeric values that you can trust. And it will show you any character that's not visible
A character in the 32-126 range is normal. If it's less than 32, and it's not \n, then change it to ' '. $text =~ s/\s+/ /g; If it's above 126, then it's an 8-bit quantity that will mess up the regex's, and probably Windows. What you do with these values depends on the assigment. This will delete them:
my $low = chr(127); my $high = chr(255); $text =~ s/[$low-$high]//g;
Some of the 8-bit values represent standard punctuation, and you can change them into 7-bit quantities. If there are two three or four consecutive 8-bit characters, then you have to deal with 16-bit, 24-bit, 32-bit UTFs. There's a definition on Wikipadia. There might a package on CPAN.
There's also a huge translation table online.
http://www.utf8-chartable.de/unicode-utf8-table.pl
Hope this helps.
Sean
|
|---|