Re: Fastest way to merge (and de-dup) two large arrays?

Replies are listed 'Best First'.
Re^2: Fastest way to merge (and de-dup) two large arrays? by perldigious (Priest) on Aug 11, 2016 at 18:55 UTC
This was where I was going as well, but I was going to suggest hashing the larger array instead of the smaller one. I guess I assumed it would be faster to do fewer loop iterations checking for the existence of a key in a larger hash than the other way around. Is this something that would have to be tested on a data set to determine for sure, or is it a for sure thing to do it faster the way you show? I ask because I've done this exact thing before and I'm curious for an explanation on the efficiency difference. I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply]
Re^3: Fastest way to merge (and de-dup) two large arrays? by Corion (Patriarch) on Aug 11, 2016 at 19:12 UTC
Every array element for the lookup will need to be hashed anyway, either for "storage" as a hash key or for lookup as a hash key, so that is a wash. I usually choose the smaller array for the hash because the smaller array is more likely to fit in memory.	[reply]
Re^4: Fastest way to merge (and de-dup) two large arrays? by SimonPratt (Friar) on Aug 12, 2016 at 07:59 UTC
I think perldigious is going with the idea that building the hash from the larger array initially and adding the smaller array to the larger one would be faster, as the initial hash build doesn't run any `if` checks (saving potentially millions of cycles) It would need testing, but on the face of it this seems to be a reasonable idea. Given that all of the records will eventually end up in the hash, if this doesn't fit in memory, then the entire solution should be reworked to find a completely different method `my %seen = map {$_ => 1} @rows; foreach my $rawData (@data) { push(@rows,$rawData) unless $seen{$rawData}++; }` [download]	[reply] [d/l] [select]
Re^5: Fastest way to merge (and de-dup) two large arrays? by perldigious (Priest) on Aug 12, 2016 at 13:52 UTC
Re^2: Fastest way to merge (and de-dup) two large arrays? by perldigious (Priest) on Aug 11, 2016 at 19:34 UTC
Isn't the `$seen{ $rawData } = 1;` line only needed if it's necessary to check for duplicates within each individual array, as opposed to only checking for duplicates between the two arrays? If checking for such duplicates within each array isn't necessary then I think that could be a small efficiency gain too because the lookup hash wouldn't keep growing as the loop iterated. I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply] [d/l]
Re^3: Fastest way to merge (and de-dup) two large arrays? by Corion (Patriarch) on Aug 11, 2016 at 19:37 UTC
`... push(@data,$rawData);` [download] `@data` gets modified whenever a new item is found, so we need to update our cache of `@data` as well.	[reply] [d/l] [select]
Re^4: Fastest way to merge (and de-dup) two large arrays? by perldigious (Priest) on Aug 12, 2016 at 13:41 UTC
I think I get it, you are trying to ensure the one to one equality between the ultimately combined array and the lookup hash of its lines, in other words maintaining the integrity between the two. I was actually thinking of treating the lookup hash like a temporary throwaway. `my @rows = <$file_for_rows>; { my %seen = map {$_ => 1} @rows; while (my $rawData = <$file_for_data>) { push @rows, $rawData if (!$seen{$rawData}); } }` [download] I figured this way the larger array is read in to `@rows` right away (assumes no need to check for duplicates within itself), and then the `@data` array is read in and checked a line at a time (it may be faster to read it all in to an array first, but I figured this saves a little temporary memory) before being added to `@rows` if it doesn't already exist in it. I added the extra unlabeled code block to lower scope everything except `@rows` assuming that once the block is done `@rows` will have all the lines from both arrays with no duplicates and the memory taken up by the lookup hash is freed up again (I'm under the impression that's an advantage of the added scoping anyway). Again, I was assuming there was no need to check for duplicates inside of each individual array, and that the lookup hash is just a temporarily created throwaway that isn't needed after the merge is finished. Sorry, I like asking little questions and debating about minutia like this because I don't have nearly as much experience with Perl or coding in general as a lot of the Monks on here (definitely no where near as much as you), so asking knit-picky little questions like these helps me learn, and hopefully helps others who are like me learn too when they read it. It's why I've quickly grown to like this place so much. I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply] [d/l] [select]
Re^2: Fastest way to merge (and de-dup) two large arrays? by technojosh (Priest) on Aug 14, 2016 at 01:06 UTC
Hashes did the trick. The function ends now in minutes instead of hours. I think initially I was trying to keep things ordered, when there was no real requirement to do so. Way faster. -TJ	[reply]