Hi all:

I am writing a new tool like rhash, only with the ability to update hashes. I just got tired of waiting for this bug to be fixed:

  Update hash if file last modification date has changed
  https://github.com/rhash/RHash/issues/107

I have looked around and, surprisingly, there is no hash/checksum tool that does that properly (that I could find).

I think I will be using File::Find to scan files and directories. The first tool version will be in Perl, but it may need to be rewritten later in C for performance reasons (or whatever).

Therefore, I want the "allfiles.checksums" file to list files and their checksums ordered in such a way that you can easily and consistently reimplement the filename sorting in any other language.

I have been reading question "Sorting utf-8" here:

  https://www.perlmonks.org/?node_id=252806

And I also looked at Unicode::Collate and other Perl Unicode documentation.

It is all pretty complicated. I have come to the conclusion that the only safe way to implement this is to do a plain UTF-8 lexicographic string sort on the filenames. I know that humans will find the sort order not good, but I think I can consider the "allfiles.checksums" file an internal database. The script itself could offer options to list its contents with different locale collation orders, if anybody really cares.

How do I implement a pure UTF-8 lexicographic string sort in Perl?

I guess I need to make sure first that the filenames returned by File::Find are actually coded in UTF-8, because Perl may choose some other internal string encoding. I hope that this is what utf8::upgrade is for.

And then I can use binary comparison operators '<' or 'cmp' on those UTF-8 strings. Is that correct?

Thanks in advance,
  rdiez


In reply to UTF-8 lexicographic string sort by rdiez

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.