Also, I get:
utf8 "\xD1" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number.

I have it with some file names piped from the find program. It happened only with some file names recently, for the first time of the few years that I've been using and developing this program.

Seems like some of the file names are corrupt.

When I print out such file names with my program, I get something like:

18.09.2012_-Протокол_вскрытия_конвертов_и_рассмотрения_заявок_на_участие_в_конк\xD1


Ф\xD1%80\xD1%8Dнк \xD0%9F\xD1%8C\xD1%8E\xD1%81елик. \xD0%9D\xD0%9B\xD0%9F. \xD0%9C\xD0%95Т\xD0%90 \xD0%9Cодел\xD1%8C.webm

The same file names displayed on the terminal by find before piping to my program display:

18.09.2012_-Протокол_вскрытия_конвертов_и_рассмотрения_заявок_на_участие_в_конк?


Ф?%80?%8Dнк ?%9F?%8C?%8E?%81елик. ?%9D?%9B?%9F. ?%9C?%95Т?%90 ?%9Cодел?%8C.webm

As I said, it's the first time I encountered such a problem after a few years of dayly usage of this program.

here is a sample piping launch of the program from the linux terminal:
find /some/path -type f|comparebin.pl /some/path/ /path/to_folder/with_similar_dir_tree/ -parameters

Update

I've just noticed, that the file names get truncated after I tried: find /some/path -type f -exec /path/comparebin.pl {} /path/to_folder/with_similar_dir_tree/ -parameters \;
Path, being provided by {} is being truncated significantly, maybe this is the problem that happens with stdout|stdin.
Seems like, there is a very small limit on how many characters can be piped or passed by {} or, maybe, the files are being truncated because of an invalid characters.
I guess, I have to resort to the usage of perl's internal find command.
I don't see anything wrong with that command, I just wanted my program to be flexible, so it could be used either way: by using it's internal directory traversal or paths being piped from some other program.

Update 2

Thank you all, who participated in my problem solving. To be honest, since I've been trying to convert my programs to unicode, my understanding about this topic was pretty vague, althoug many things. After solving my problem got clarified, there is still a lot to understand about utf8 and unicode in general. When I look at amount of the perl's unicode documentation, it's pretty daunting when I realize that I need to therally read and digest all it. Until now, I thought that unicode is an answer to all textual problems and everything should be in utf8, until I stumbled on this particular problem. Now, I am realizing, that there are excepthions.

At first, I didn't even have a clue, where to start to solve my problem, after talking to you. I understood, what needs to be done, but didn't understand, how. That frustrated me, because, I felt like unicode should be behind the curtains and I didn't want to saturate the fun of programming, which I love, with the daunting unicode "bookkeeping". Also, I keep confusin gthe encode and decode commands. Then I calmed down, skimmed the unicode, utf8 and encode documentation for the needed parts and started trying.

When I set up a check on every variable, involved in path/file name processing for utf8-ness (utf8::is_utf8) and if it is utf8, set the utf8 flag off (Encode::_utf8_off), along the path of the code, the final paths started resolving for existence (-e). I realize, that if I encounter some part of the path, converted to utf8 and set the flag off, if that path portion was corrupt, before became utf8, the final resulted path could not resolve for existence (-e), but I don't know how to process certain strings without them being converted to character mode, like regex substitution, always returning a value with utf8 flag set, for example, so, for now, I will live it as it is and work on the fix and read more of utf8 and unicode docs when I encounter such problem.


In reply to utf8 "\xD0" does not map to Unicode at /path/comparebin.pl line line_number, <STDIN> line line_number by igoryonya

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.