Re: Perl / FileFind or ...
by graff (Chancellor) on Nov 28, 2012 at 04:53 UTC
|
What makes you think that handling non-ASCII characters in path/file names should be simple? I suppose that if you have intimate knowledge about the OS you're using, and about the file system installed on the specific disk volume you're using, and about the capabilities of the particular terminal/browser/other application that is trying to display file name strings on your monitor, and about the environment/configuration settings that control the behavior of that application, and about the process(es) that created the file names on that specific disk volume in the first place, then you might know enough for the handling of non-ASCII file names to seem "simple."
But if you lack intimate knowledge on any of those topics, your first resort should be to get a hex-dump view of the byte sequences being used in any given file name string. That way, all you need is a general knowledge of the possible non-ASCII character encodings, and perhaps some presupposition about the (human) language being used by the person who assigned the file name (or at least, some sense of the alphabet being used - Cyrillic? Greek? Latin? Arabic? ... - including the range of diacritic marks, odd-ball punctuation and/or special symbols that are likely to show up). Not that this in itself is "simple", but at least there are fewer moving parts.
Obviously, getting a hex-dump style output just gets in the way when file paths contain nothing outside the printable ASCII range, so a useful elaboration of your File::Find callback might go something like this:
sub cbFileFind
{
my $printable_name = $File::Find::name;
$printable_name =~ s/([^ -~])/sprintf("\\x{%02x}",ord($1))/eg;
print $printable_name, "\n";
}
If you happen to already know (or if the approach just shown makes it clear) what the particular character encoding is for the non-ASCII portions of your file names, you can use Encode to convert (decode) the strings as read from the file system into perl-internal (utf8) encoding, and then the "ord()" function will return unicode code-point numbers. which you can look up in case the particular characters are unfamiliar to you (check out Re: Regular expressions and accents and tlu -- TransLiterate Unicode). | [reply] [d/l] |
Re: Perl / FileFind or ...
by runrig (Abbot) on Nov 27, 2012 at 21:46 UTC
|
You were expecting maybe: C:/Tmp/Justin Bieber
C:/Tmp/The Archies
C:/Tmp/Debby Boone
??? | [reply] [d/l] |
|
| [reply] [d/l] |
|
| [reply] |
|
Re: Perl / FileFind or ...
by TomDLux (Vicar) on Nov 27, 2012 at 21:20 UTC
|
I think you're complaining about getting the divide symbol, or '#8319;' instead of accented characters.
Try utf8 instead of USASCII.
As Occam said: Entia non sunt multiplicanda praeter necessitatem.
| [reply] |
Re: Perl / FileFind or ...
by Anonymous Monk on Nov 27, 2012 at 20:37 UTC
|
| [reply] |
Re: Perl / FileFind or ...
by blue_cowdawg (Monsignor) on Nov 27, 2012 at 20:46 UTC
|
As the mystery monk implies: what's the question?
Looks like it is working as designed...
Peter L. Berghold -- Unix Professional
Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
| [reply] |
Re: Perl / FileFind or ...
by Festus Hagen (Acolyte) on Nov 27, 2012 at 21:56 UTC
|
Yea Tom, that be exactly the issue.
Right or wrong, I have tried utf8, unicode and many other things found while searching, to no avail.
Guess I just don't get it, Why such a simple thing is so difficult.
-Enjoy fh : )_~
| [reply] |
|
$ chcp
Active code page: 437
$ echo > "da-MötleyCrüe"
$ dir /b "da-*"
da-MötleyCrüe
$ dir /b "da-*" | perl -MData::Dump -e " dd[<>] "
["da-M\x94tleyCr\x81e\n"]
$ perl -MData::Dump -e " dd[ glob q/da-*/ ] "
["da-M\xF6tleyCr\xFCe"]
Single byte encoding can be hard to guess
$ perl -MEncode::Detective=detect -le " die detect( glob q/da-*/ ) "
windows-1252 at -e line 1.
$ perl -MEncode::Guess -e " die guess_encoding( glob q/da-*/ ) "
No appropriate encodings found! at -e line 1.
$ dir /b "da-*" | perl -MEncode::Detective=detect -e " $f = <>; die de
+tect($f ) "
Died at -e line 1, <> line 1.
$ dir /b "da-*" | perl -MEncode::Guess -e " $f = <>; die guess_encodin
+g($f ) "
No appropriate encodings found! at -e line 1, <> line 1.
$ dir /b "da-*" | perl -MEncode::Guess -e " $f = <>; die guess_encodin
+g($f , q/cp437/) "
Encode::XS=SCALAR(0x9a622c)
$ dir /b "da-*" | perl -MEncode::Guess -e " $f = <>; die guess_encodin
+g($f , q/cp437/)->name "
cp437 at -e line 1, <> line 1.
But once you know, just binmode
$ perl -le " print for glob q/da-*/ "
da-M÷tleyCrⁿe
$ perl -le " binmode STDOUT , q/:encoding(cp437)/; print for glob q/da-*/ "
da-MötleyCrüe
$ perl -Mopen=:std,encoding(cp437) -le " print for glob q/da-*/ "
da-MötleyCrüe
$ perl -MEncode::Locale -le " binmode STDOUT, q{encoding(console_out)}; print for glob q/da-*/ "
da-MötleyCrüe
| [reply] [d/l] [select] |
Re: Perl / FileFind or ...
by Festus Hagen (Acolyte) on Nov 27, 2012 at 21:10 UTC
|
Y'all are kidding right ??
-Enjoy fh : )_~
| [reply] |
|
Y'all are kidding right ??
No, are you kidding?
Between the perlmonks latin-1 limitation , the variability of win32 filesystuems (fat/ntfs/...), and whatever you're dealing with, I don't know what you're complaining about.
It is either what you see in the console, in which case binmode something, Text::Unidecode ... whatever you want
Or the problem is the ANSI filenames you get on win32( When Unicode Does Not Happen ), in which case you need GetLongPathName or Win32::Unicode::Native
I know what I mean. Why don't you?, How do I post a question effectively?
When you're asked for clarification, it probably isn't a joke.
| [reply] |
Re: Perl / FileFind or ...
by Festus Hagen (Acolyte) on Nov 28, 2012 at 15:30 UTC
|
First, Thanks to Anonymous Monk for an excellent and informative post.
Simple ... Yea, it should be!
Why?
Because it's a high level language (or supposed to be), And it should be smart enough to handle basic OS configuration.
All Perl has to do is ask the OS and set itself accordingly!
As is pointed out in this thread.
Now if it was a string created from user data, that would be a different story.
It's not, it's OS data ... The OS knows what it is, Perl should as well!
-Enjoy fh : )_~
| [reply] |
|
There is no standard way to communicate the encoding of a file system.
| [reply] |
|
Know what?
Your comments are very confused , esp because you're [reply]ing to yourself, again
If I assume you're talking about the console code page, perl doesn't assume you're writing a terminal program
| [reply] |