insensate has asked for the wisdom of the Perl Monks concerning the following question:

Greetings:
I have 2 large data sets contained in arrays. Respectively, the sets contain about 30,000 and 100,000 elements. The elements each consist of an identifier and a timestamp delimited by a pipe.
example:
Set 1 (@data): g123|10249875 g948|32004040 Set 2 (@ref): g123:1|10249870 g123:2|10249875 g123:3|10249871 g984:2|94950595 g984:1|32004040
My task is to create a list of identifiers from set 2 where the datestamp is not equal to the corresponding stamp in set 1. This is my code:
foreach $elem(@data){ ($sub,$max)=split/\|/,$elem; @found=grep /^$sub:/,@ref; foreach $f(@found){ ($id,$time)=split/\|/,$f; if($time ne $max){ push @set,$id; } } }
...And the id's I am looking for end up in @set:
@set: g123:1 g123:3 g984:2
Sadly, this method is prohibitively slow. I am looking for suggestions on how to make this process more efficient.
Thanks in advance,
Jason

Replies are listed 'Best First'.
Re: Large Set Efficiency Question
by kvale (Monsignor) on Jul 29, 2002 at 22:53 UTC
    Lookups of the time stamps could be made a lot faster by using a hash:
    my %data_map; foreach (@data) { my ($class, $stamp) = split /\!/, $_; $data_map{$class} = $stamp; } foreach my $elem (@ref) { my ($class, $subcalss, $stamp) = split /:\|/, $elem; push @set, "$class:$subclass" if $data_map{$class} != $stamp; }
    -Mark
Re: Large Set Efficiency Question
by Bird (Pilgrim) on Jul 29, 2002 at 22:48 UTC
    Hi,

    How about something like this, using a hash instead of an array for @data...

    my %data_hash; foreach $elem (@data) { my ($sub, $max) = split /\|/, $elem; $data_hash{$sub} = $max; } foreach $r (@ref) { my ($id, $time) = split /\|/, $r; my ($key, $instance) = split /:/, $id; if ($data_hash{$key} ne $time) { push @set, join (':', $key, $instance); } }
    This assumes that there are no duplicate $sub values in the data file. Since you refer to them as identifiers, I'm assuming they are unique identifiers. Anyway, I didn't run any benchmarks (or even test this code), but this approach should speed things up a bit.

    -Bird

    Oh, this will also print any id's in @ref that don't show up in @data at all. You can tweak that if you'd like though, just modify the conditional a bit.

Re: Large Set Efficiency Question
by dpuu (Chaplain) on Jul 29, 2002 at 22:59 UTC
    In the example you give, your arrays are sorted. If this is always the case (or can be made to be), then you could iterate the two arrays in sloppy-lockstep--Dave
      If I am reading the question correctly, I would use a modified version of dpuu's parent post.

      sort -u -t '|' infile1 infile2 | perl myprog.pl

      And just have myprog.pl look for lines where the leading key is repeated two lines in a row. If the timestamp was the same in both files, there will be only one row (-u means make unique).

      -jack

Re: Large Set Efficiency Question
by BorgCopyeditor (Friar) on Jul 29, 2002 at 22:43 UTC

    Just a guess here, not a tip exactly: while you're iterating over @found, would it be faster only to split when the timestamp doesn't match? I.e.:

    foreach $f(@found) { if ($max !~ /$f/) { ($id)=split(/\|/,$f); push @set, $id; } }

    BCE
    --Your punctuation skills are insufficient!

Re: Large Set Efficiency Question
by jlongino (Parson) on Jul 29, 2002 at 22:56 UTC
    Assuming you have enough memory to store the smaller dataset in a hash (after all, your code uses arrays to hold both datasets).

    Instead of storing @data in an array, store it in a hash (%data) with the key being $id . $datestamp.

    Read in each $ref and test if the key built from it exists in %data. If not print the $ref or store it in an array for processing later.

    --Jim

    Update: Oh well, conversation with PHB slowed my reply but looks like you've already got what you need. :)