Compare 2 very large arrays of long strings as elements

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Compare 2 very large arrays of long strings as elements by roboticus (Chancellor) on Oct 28, 2011 at 15:18 UTC
onlyIDleft: That's far too long! It took only a few seconds^[1] on my machine: $ time perl foo.pl real 0m9.204s user 0m3.697s sys 0m1.387s $ cat foo.pl #!/usr/bin/perl -w use strict; use warnings; open F1, 'sort s1\|' or die "opening file 1"; open F2, 'sort s2\|' or die "opening file 2"; open OU1, '>', 'uniq.1' or die; open OU2, '>', 'uniq.2' or die; open OU3, '>', 'common' or die; # Prime the pump my $in1 = <F1>; my $in2 = <F2>; while (1) { last if !defined($in1) and !defined($in2); if (!defined($in1)) { # File 1 is empty, rest of File 2 is unique print OU2 $in2; $in2 = <F2>; } elsif (!defined($in2)) { # File 2 is empty, rest of File 1 is unique print OU1 $in1; $in1 = <F1>; } elsif ($in1 eq $in2) { # Line common to both print OU3 $in1; $in1 = <F1>; $in2 = <F2>; } elsif ($in1 lt $in2) { # Line unique to File 1 print OU1 $in1; $in1 = <F1>; } else { # Line unique to File 2 print OU2 $in2; $in2 = <F2>; } } [download] I generated the two test files like so: `$ time perl gen_random_strings.pl 2000000 60 70 ABC >s1 real 0m45.910s user 0m44.647s sys 0m1.091s $ time perl gen_random_strings.pl 2000000 60 70 ABC >s2 real 0m45.989s user 0m44.475s sys 0m1.138s` [download] Using a quickie random-string generator: `#!/usr/bin/perl # # gen_random_strings.pl <num_strings> <min_length> <max_length> <al +phabet> # my $num_strings = shift; my $min_length = shift; my $max_length = shift; my $alphabet = shift or die "Missing argument(s)!"; my $alphabet_len = length($alphabet); while ($num_strings--) { my $num_chars = int(rand($max_length-$min_length+1))+$min_length; my $string = ''; $string .= substr($alphabet, int(rand($alphabet_len)), 1) for 1 .. $num_chars; print $string, "\n"; }` [download] Notes: [1] It took only a few seconds because the data was still in RAM from having been generated. It would've only taken a few minutes otherwise, as you can tell from the time it took to generate the files. [2] This code is a slight hack of a merge sort I illustrated some time ago. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re: Compare 2 very large arrays of long strings as elements by SuicideJunkie (Vicar) on Oct 28, 2011 at 14:33 UTC
Quantum::Superpositions doesn't sound like an appropriate module for efficiently comparing large arrays of strings, although in the future it might include some quantum XS code that will run very fast on a quantum CPU. The easy way to extract unique items is to use a hash. `# for example my %counts; # count how many of each value you see. $counts{$_}++ foreach (@array); # Pull out only the ones you've seen exactly one of. my @uniques = grep {$counts{$_} == 1} keys %counts;` [download]	[reply] [d/l]
Re: Compare 2 very large arrays of long strings as elements by GrandFather (Saint) on Oct 28, 2011 at 22:49 UTC
2 million * 100 * 2 is much less than 1e9 so should easily fit into memory even for oldish computers. A simple Perl hash based technique used for finding unique strings should work well and is both quick to write and quick to execute. If the data were 5 times as big it's likely that processing would start to slow down some, but even then processing would likely only take a few tens of seconds plus the time needed to read the data from disk. True laziness is hard work	[reply]
Re: Compare 2 very large arrays of long strings as elements by johnny_carlos (Scribe) on Oct 28, 2011 at 14:50 UTC
Can you distribute them? At the very least, run each array on two different servers. If you have more servers, I'm thinking you could slice the arrays, distribute,run SuicideJunkie's algorithm, then recombine and run again. EDIT: whoops, I misunderstood. You are comparing the two arrays. You can't separate them, making my suggesting not so hot :)	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


P is for Practical
	PerlMonks