in reply to Duplicates in Directories

I think that you are overlooking an obvious optimization.   If the file-names come to you in alphabetical order, as they most-commonly do, then all occurrences of baz.anything will necessarily be consecutive.   Simply split each filename by "." into two pieces (filename, extension), and notice if the name is different from the previous filename you encountered (or if it is the very first one).   In this case, test the list of extensions that you had been accumulating to see if both .epub and .pdf are present in that list, then reset the list.   You never have to “search” for anything, nor do you ever need to store more than two names:   “this” one, and the “immediately previous” one.

Replies are listed 'Best First'.
Re^2: Duplicates in Directories
by haukex (Archbishop) on Oct 09, 2017 at 14:22 UTC
    If the file-names come to you in alphabetical order, as they most-commonly do, then all occurrences of baz.anything will necessarily be consecutive.

    This is not something one should rely on, it varies wildly depending on OS and API used to list the files. And even if one sorted the list, it would not help for the OP's second requirement, "some 'dups' might have minor variations in characters".

    Oops, didn't see Corion's node before posting.

Re^2: Duplicates in Directories
by Corion (Patriarch) on Oct 09, 2017 at 14:20 UTC

    Whatever you mean by "most-commonly" it is certainly not something that a program should rely on.

    Luckily for you and the OP, Perl comes with a built-in facility to sort a list of filenames. Using this facility makes it easy to ensure that a list is sorted.

      Well, for a given operating system (and possibly file system), either they do come in alphabetical order, or they don't. If they do for the OP's system, then presumably the OP can rely on that feature (although there can be some issues with the upper or lower case of the file names). And, BTW, I've just checked on the three different systems available to me (*nix, VMS and Windows), glob returned the names of the files in the directory in alphabetical system for all three of them, so, yes, it is a rather common feature.

      Then, of course, as you rightly said, if they don't come in alphabetical order, or if there is any doubt, it is just as easy to use the Perl sort facility, and it will only take a few split seconds with 10,000 files.

      The idea of sorting data to get better performance (avoiding lookups) is sometimes very efficient. I'm doing it quite commonly in a slightly different context, to compare pairs of very large files that would not fit in a hash: sorting both files on the comparison key (using the *nix sort utility), and then reading both files in parallel in my Perl program to detect records missing from either file or differences in attributes of records having the same comparison key.

        glob returned the names of the files in the directory in alphabetical system for all three of them,

        I may be a caveman but id use readdir to get the file list and i think that it returns files in the order they are stored in the directory

Re^2: Duplicates in Directories
by Anonymous Monk on Oct 09, 2017 at 14:19 UTC

    Auto-logged-out again.   Oh well.   Just one more thing:   remember to do this once more when you reach the end of the file, because that is of course what marks the end of the final group of names.

      Oh yeah, and one more thing:   this strategy assumes a case-sensitive file system or that the case of the file-names does not actually vary.   Or that you have the easy means to obtain a case-insensitive sorted filename list.   But the essential algorithm remains the same.   The filesystem’s doing all the hard work for you.

        Maybe you could show some actual code solving your assumptions in Perl?

        That way, you don't need to hope that the users or filesystems behave in the way you describe (because neither does).