comment on

Hi all:

I am writing a new tool like rhash, only with the ability to update hashes. I just got tired of waiting for this bug to be fixed:

Update hash if file last modification date has changed
https://github.com/rhash/RHash/issues/107

I have looked around and, surprisingly, there is no hash/checksum tool that does that properly (that I could find).

I think I will be using File::Find to scan files and directories. The first tool version will be in Perl, but it may need to be rewritten later in C for performance reasons (or whatever).

Therefore, I want the "allfiles.checksums" file to list files and their checksums ordered in such a way that you can easily and consistently reimplement the filename sorting in any other language.

I have been reading question "Sorting utf-8" here:

https://www.perlmonks.org/?node_id=252806

And I also looked at Unicode::Collate and other Perl Unicode documentation.

It is all pretty complicated. I have come to the conclusion that the only safe way to implement this is to do a plain UTF-8 lexicographic string sort on the filenames. I know that humans will find the sort order not good, but I think I can consider the "allfiles.checksums" file an internal database. The script itself could offer options to list its contents with different locale collation orders, if anybody really cares.

How do I implement a pure UTF-8 lexicographic string sort in Perl?

I guess I need to make sure first that the filenames returned by File::Find are actually coded in UTF-8, because Perl may choose some other internal string encoding. I hope that this is what utf8::upgrade is for.

And then I can use binary comparison operators '<' or 'cmp' on those UTF-8 strings. Is that correct?

Thanks in advance,
rdiez

In reply to UTF-8 lexicographic string sort by rdiez

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.