Large Set Efficiency Question

insensate has asked for the wisdom of the Perl Monks concerning the following question:

Greetings:
I have 2 large data sets contained in arrays. Respectively, the sets contain about 30,000 and 100,000 elements. The elements each consist of an identifier and a timestamp delimited by a pipe.
example:

Set 1 (@data):
g123|10249875
g948|32004040

Set 2 (@ref):
g123:1|10249870
g123:2|10249875
g123:3|10249871
g984:2|94950595
g984:1|32004040
[download]

My task is to create a list of identifiers from set 2 where the datestamp is not equal to the corresponding stamp in set 1. This is my code:

foreach $elem(@data){
   ($sub,$max)=split/\|/,$elem;
   @found=grep /^$sub:/,@ref;
   foreach $f(@found){
      ($id,$time)=split/\|/,$f;
      if($time ne $max){
         push @set,$id;
      }
   }
}
[download]

...And the id's I am looking for end up in @set:

@set:
g123:1
g123:3
g984:2
[download]

Sadly, this method is prohibitively slow. I am looking for suggestions on how to make this process more efficient.
Thanks in advance,
Jason

Comment on Large Set Efficiency Question Select or Download Code

Replies are listed 'Best First'.
Re: Large Set Efficiency Question by kvale (Monsignor) on Jul 29, 2002 at 22:53 UTC
Lookups of the time stamps could be made a lot faster by using a hash: `my %data_map; foreach (@data) { my ($class, $stamp) = split /\!/, $_; $data_map{$class} = $stamp; } foreach my $elem (@ref) { my ($class, $subcalss, $stamp) = split /:\\|/, $elem; push @set, "$class:$subclass" if $data_map{$class} != $stamp; }` [download] -Mark	[reply] [d/l]
Re: Large Set Efficiency Question by Bird (Pilgrim) on Jul 29, 2002 at 22:48 UTC
Hi, How about something like this, using a hash instead of an array for `@data`... `my %data_hash; foreach $elem (@data) { my ($sub, $max) = split /\\|/, $elem; $data_hash{$sub} = $max; } foreach $r (@ref) { my ($id, $time) = split /\\|/, $r; my ($key, $instance) = split /:/, $id; if ($data_hash{$key} ne $time) { push @set, join (':', $key, $instance); } }` [download] This assumes that there are no duplicate `$sub` values in the data file. Since you refer to them as identifiers, I'm assuming they are unique identifiers. Anyway, I didn't run any benchmarks (or even test this code), but this approach should speed things up a bit. -Bird Oh, this will also print any id's in `@ref` that don't show up in `@data` at all. You can tweak that if you'd like though, just modify the conditional a bit.	[reply] [d/l] [select]
Re: Large Set Efficiency Question by dpuu (Chaplain) on Jul 29, 2002 at 22:59 UTC
In the example you give, your arrays are sorted. If this is always the case (or can be made to be), then you could iterate the two arrays in sloppy-lockstep--Dave	[reply]
Re: Re: Large Set Efficiency Question by jackdied (Monk) on Jul 30, 2002 at 00:12 UTC
If I am reading the question correctly, I would use a modified version of dpuu's parent post. sort -u -t '\|' infile1 infile2 \| perl myprog.pl And just have myprog.pl look for lines where the leading key is repeated two lines in a row. If the timestamp was the same in both files, there will be only one row (-u means make unique). -jack	[reply]
Re: Large Set Efficiency Question by BorgCopyeditor (Friar) on Jul 29, 2002 at 22:43 UTC
Just a guess here, not a tip exactly: while you're iterating over `@found`, would it be faster only to split when the timestamp doesn't match? I.e.: `foreach $f(@found) { if ($max !~ /$f/) { ($id)=split(/\\|/,$f); push @set, $id; } }` [download] BCE --Your punctuation skills are insufficient!	[reply] [d/l] [select]
Re: Large Set Efficiency Question by jlongino (Parson) on Jul 29, 2002 at 22:56 UTC
Assuming you have enough memory to store the smaller dataset in a hash (after all, your code uses arrays to hold both datasets). Instead of storing `@data` in an array, store it in a hash (`%data`) with the key being `$id . $datestamp`. Read in each `$ref` and test if the key built from it exists in `%data`. If not print the `$ref` or store it in an array for processing later. --Jim Update: Oh well, conversation with PHB slowed my reply but looks like you've already got what you need. :)	[reply] [d/l] [select]