doubleqq has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I am trying to correlate two files together. Basically I am using the elements of the master file to look up the matching element in my index file and push the matching value into a new array. The thing is, it seems to run extremely slow. The files themselves are not overly huge, 30MB for the master file and 500MB for the index. I strongly suspect that this search could be done better and my logic is not quite right. Any idea to speed up this task is greatly appreciated. Thank you!

This is my code:

my $index = 'path/to/file/file.name'; tie (@master, 'Tie::File', $masterList) or die "Can't tie $masterList" +; foreach my $pos(@master){ chomp $pos; my $quickLook = qx(grep '^$pos' $index); my @split = split(/\t/,$quickLook); print $tFile $split[1]; push (@idxArray, $split[1]); }
This is my Master data
1 2 3 4 6 8 10 12 15 17 etc... 3000000
and this is my index data
1 atc 2 gca 3 att 4 ggc 5 aaa etc... 29000000 ttg

Replies are listed 'Best First'.
Re: Better way to search an Array?
by hippo (Archbishop) on Jun 12, 2015 at 08:33 UTC

    Your script currently shells out to a grep for every single entry in your master file. That's not good for efficiency. Why not shell out once before the loop and let join take the strain? There's even an implementation of join in pure perl if your OS doesn't have it. At that point your perl code just opens the output from join and reads in and filters the results once.

    HTH

Re: Better way to search an Array?
by Anonymous Monk on Jun 12, 2015 at 02:17 UTC

    Sometimes it can be better to do it externally.

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1130122 use strict; use warnings; my $want = undef; my @idxArray; # sort files together :) open my $fh, '-|', '/usr/bin/sort -n d.master d.index' or die "$! open +ing sort"; while(<$fh>) { if( /^(\d+)$/ ) { $want = $1; } elsif( $want && /^$want\s+(\S+)/ ) { push @idxArray, $1; $want = undef; } else { $want = undef; } } close $fh or die "$! on close of sort"; use YAML; print Dump \@idxArray; # for debugging __END__

    I hope you're on an OS with a decent sort :)

      I know it's a week late, but I just wanted to say; Thank you!! Your suggestion is amazing, and versatile too. I can't believe I have been limping without using even thinking about using sort.
Re: Better way to search an Array?
by Anonymous Monk on Jun 12, 2015 at 01:22 UTC

    Does all of your "index data" look exactly like that - each line begins with its line number, followed by whitespace, followed by three letters? If you could confirm that for us, that would allow for some major optimizations.

    Anyway, the usual suggestions for speeding something like this up would be to load the index into a hash, or if the data structures get too big for memory, going to disk (e.g. a tied hash or array), or even into a database. See also the recent thread improve performance for some ideas in regards to using a hash.

      Thank you for clarifying. Every line is the format:

      {number sequential}{tab space}{letters, variable length}

Re: Better way to search an Array?
by akuk (Beadle) on Jun 12, 2015 at 12:17 UTC

    May be you can try to read data in hash first as it is an index data, then match key to it and add the value. I am assuming your master data in the format that you have provided.

    #!/usr/bin/perl -w use strict; my $file1 = "file1.txt"; my $file2 = "file2.txt"; my %hash = (); my %ahash = (); open F1, $file1 or die "Can't open $file1 $!"; open F2, $file2 or die "Can't open $file2 $!"; while(<F1>){ my $line = $_; chomp($line); $hash{$line} = $line; } while(<F2>){ my $line = $_; chomp($line); (my $w1, my $w2) = split(/ +/, $line); $ahash{$w1} = $w2; } # Matching of two Hash for ( keys %hash ) { unless ( exists $ahash{$_} ) { print "$_: not found in second hash\n"; next; } if ( $hash{$_} eq $ahash{$_} ) { print "$_: values are equal\n"; # Do something with the value } else { print "$_: values are not equal\n"; } }

    Might be it will give you an idea to compare files data using hash.

Re: Better way to search an Array? (Tie::File options)
by Anonymous Monk on Jun 12, 2015 at 01:11 UTC

    Yes, Tie::File can be slow, so try one of the options (about memory buffers) and see what happens

      Tie::File can be slow

      Care to back that statement up in regards to this question?

      Did you miss the qx(grep '^$pos' $index) being executed in every iteration of the loop?

        Tie::File can be slow

        Care to back that statement up in regards to this question?

        Did you miss the qx(grep '^$pos' $index) being executed in every iteration of the loop?

        If you're trying to help the OP, you should respond to Better way to search an Array?