in reply to utf8 "\xD0" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number

What do you know about the process(es) creating these file names? That's likely to be the source of the problem.

Assuming you are using a utf8-based terminal window, the question mark that you see in the terminal at the end of the file name is a symptom of a malformed character in a utf8 string (such as a start byte like \xD0 or \xD1 that is not followed by a valid continuation byte).

The file system doesn't really care about how (or whether) the byte sequence used for a file name is interpreted via this or that character encoding - there are some characters in the ASCII range that can't be used in a file name (e.g. null or slash on unix/linux), but apart from that, any byte sequence is as good as any other, whether or not it makes sense when using any given character encoding.

You should be able to rename the affected files - perl is especially handy for doing this: either you can infer the intended character(s), or you can simply replace bad bytes with something valid that yields a unique file name in the given directory. In order to rename the file, you have to treat the existing (bad) file name as a raw byte sequence, not as utf8 characters.

(You might consider going to ASCII-only characters for file names - e.g. using a suitable Cyrillic-to-Latin transliteration - to avoid the problems that tend to come up with multi-byte characters in file names.)

  • Comment on Re: utf8 "\xD0" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number

Replies are listed 'Best First'.
Re^2: utf8 "\xD0" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number
by Anonymous Monk on Nov 19, 2014 at 04:00 UTC
    (You might consider going to ASCII-only characters for file names - e.g. using a suitable Cyrillic-to-Latin transliteration - to avoid the problems that tend to come up with multi-byte characters in file names.)
    What a... peculiar thing to say. There are no problems with multi-byte chars in file names. There might be problems with things truncating file names, and transliteration is a really bad way to fix that.
Re^2: utf8 "\xD0" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number
by igoryonya (Pilgrim) on Nov 21, 2014 at 10:48 UTC

    The folders/files were recovered by using a testdisk program in linux from the accidentally deleted ntfs partition.

    When I test the passed corrupt file name to the perl program with the -e, it says that the file doesn't exist, although, if I use an internal perl's directory reading, it shows those files fine without any character problems and if I test files, listed by perl for existence, -e proves their existence.

    So, if I understand correctly, when I represent the path string, piped from the find process to my program, with a byte steam, it should test correctly for existence by using -e.

    I've been trying to implement a routine that will recover from such corruption and find the file correctly when passed from stdin. I want to keep the ability to pipe the names from the external source.

      So, if I understand correctly, when I represent the path string, piped from the find process to my program, with a byte steam, it should test correctly for existence by using -e.

      And it does if you stop trying transforming the input from UTF-8 (which it isn't) to Unicode Code Points.

      I've been trying to implement a routine that will recover from such corruption

      Much easier to remove the erroneous conversion attempt that's corrupting it.

        The files were recovered from an ntfs file system to ext4 in ubuntu.
        I thought that ext4 and ubuntu use utf8 by default, but I will try to set binmode STDIN to raw encoding, to see, if it helps.