comment on

What makes you think that handling non-ASCII characters in path/file names should be simple? I suppose that if you have intimate knowledge about the OS you're using, and about the file system installed on the specific disk volume you're using, and about the capabilities of the particular terminal/browser/other application that is trying to display file name strings on your monitor, and about the environment/configuration settings that control the behavior of that application, and about the process(es) that created the file names on that specific disk volume in the first place, then you might know enough for the handling of non-ASCII file names to seem "simple."

But if you lack intimate knowledge on any of those topics, your first resort should be to get a hex-dump view of the byte sequences being used in any given file name string. That way, all you need is a general knowledge of the possible non-ASCII character encodings, and perhaps some presupposition about the (human) language being used by the person who assigned the file name (or at least, some sense of the alphabet being used - Cyrillic? Greek? Latin? Arabic? ... - including the range of diacritic marks, odd-ball punctuation and/or special symbols that are likely to show up). Not that this in itself is "simple", but at least there are fewer moving parts.

Obviously, getting a hex-dump style output just gets in the way when file paths contain nothing outside the printable ASCII range, so a useful elaboration of your File::Find callback might go something like this:

sub cbFileFind
{
    my $printable_name = $File::Find::name;
    $printable_name =~ s/([^ -~])/sprintf("\\x{%02x}",ord($1))/eg;
    print $printable_name, "\n";
}
[download]

If you happen to already know (or if the approach just shown makes it clear) what the particular character encoding is for the non-ASCII portions of your file names, you can use Encode to convert (decode) the strings as read from the file system into perl-internal (utf8) encoding, and then the "ord()" function will return unicode code-point numbers. which you can look up in case the particular characters are unfamiliar to you (check out Re: Regular expressions and accents and tlu -- TransLiterate Unicode).

In reply to Re: Perl / FileFind or ... by graff
in thread Perl / FileFind or ... by Festus Hagen

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.