in reply to Memory issue with large array comparison

Seems to me (subject to the Monks' corrections) that the OP's precise words mean no hash is required: Just store (push) to a new array those items from the "large list" that have no match in the "large array". And that requires only a pretty basic regex ... unless the OP means (without specifying it) that the "large list" includes more than the single path in the example... and even that solves itself with the same simple regex... capturing everything after the last slash to test the match.

OP makes no statement that the "new array" referred to in the 2nd para

"I am using some code I found on here to create a new array that contains pathnames in which the id after the last slash in the pathname doesn't match any of the id's in the array of id's."
that requires anything more than an array of non-matches.

When there's a simple way and a hard way that produce the same results, go simple!
  • Comment on Re: Memory issue with large array comparison

Replies are listed 'Best First'.
Re^2: Memory issue with large array comparison
by aaron_baugher (Curate) on May 25, 2012 at 00:54 UTC

    My understanding is that he has two arrays: the first containing 50,000+ pathnames, and the second containing 10,000 strings which may match a filename in one of the pathnames. I don't see how you exclude those matches with a regex, unless you concat all 10,000+ strings together with pipe symbols, and I doubt a regex like that would be very efficient.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

      You may be right. Certainly the regex you envision would be a tad clumsy. :)

      At the moment, it's looking to me as though I should have written the code to support my claim re regex solutions, before posting, rather than simply doing a mental sketch (which proved to be wrong in at least that particular instance). Thank you for making me "think, again!"

      However, on the theory that simple (or, in this case, 'concise') is desirable:

      #!/usr/bin/perl use 5.014; # 972317 my @arr1 = ('C:\abc\def\zz456','C:\abc\xyz\ab123','C:\abb\bba\ac234',' +C:\abc\xyz\ab321'); my @arr2 = qw/ab123 ac234 ad456 ab321/; # no match for C: +\abc\def\zz456 my $dir_entry; my @arr3 = grep { $dir_entry = $_; not grep { $dir_entry =~ /\Q$_/i } +@arr2 } @arr1; say "Following is/are dir entry/entries (in \$arr1) without matching i +tems in \@arr2"; say each(@arr3);

      Output ( index and array element; see each ):
      0C:\abc\def\zz456

      And can some knowledgeable Monk explain why the index of the zeroth element of the array is rendered as 0C?

      Update: A knowledgeable Monk (++) has indeed explained... and the answer has little to do with the assumption immediately above. Meh! Blech on me.

        This is what the original poster did, but with different variable names and a slightly different regex, and it ran out of memory. But isn't this O(N2)? It seems to me that it greps every item in one array against all the items in the other array, so it's really no different from this:

        my @array3; OUTER: for my $a1 (@array1){ for my $a2 (@array2){ next OUTER if they_match_somehow(); } push @array3, $a1; # it didn't match anything in @array2 }

        Both cases have two nested loops; it's just harder to see them in the grep-within-a-grep method.

        Aaron B.
        Available for small or large Perl jobs; see my home node.

        can some knowledgeable Monk explain why the index of the zeroth element of the array is rendered as 0C?
        Not aspiring at the title of knowledgeable Monk, but the index is 0 and the string starts with C:. You are grepping @arr1 which contains full paths.
      I don't see how you exclude those matches with a regex, unless you concat all 10,000+ strings together with pipe symbols, and I doubt a regex like that would be very efficient.
      One solution: Regexp::Assemble

      It creates an optimized regex that checks all your strings at once. It is usually shorter and much faster than your "pipe"d regex. For example, 10,000 strings of each 5 characters were turned into a regex less than 25,000 characters long.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics