Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

match two files

by yueli711 (Sexton)
on Jun 03, 2020 at 09:04 UTC ( [id://11117634]=perlquestion: print w/replies, xml ) Need Help??

yueli711 has asked for the wisdom of the Perl Monks concerning the following question:

Hello I wrote a perl code to match two files. But when the input file is very large, it runs very very long time. How I can shorter the running time by change some code? Thanks in advance for any great help! Best, Yue

open(IN1,"tmp12") || die "Cannot open this file"; @lines1 = <IN1>; open(IN2,"donor_82_01.csv") || die "Cannot open this file"; @lines2 = <IN2>; open(OUT,">tmp12_01") || die "Cannot open this file"; for $item1(@lines1){ chomp $item1; #print OUT $item1,"\t"; @tmp1=split(/\t+/, $item1); for $item2(@lines2){ chomp $item2; @tmp2=split(/\,+/, $item2); if ($tmp1[1] eq $tmp2[0]){ print OUT $tmp1[0],",",$item2; last; } $i++ } print OUT "\n"; } close(IN1); close(IN2); close(OUT);

The file of tmp12 is:

A1BG ENSG00000121410 A1BG-AS1 ENSG00000268895 A1CF ENSG00000148584 A2M ENSG00000175899 A2M-AS1 ENSG00000245105 A2ML1 ENSG00000166535 A2ML1-AS1 ENSG00000256661 A2ML1-AS2 ENSG00000256904 A3GALT2 ENSG00000184389 A4GALT ENSG00000128274 A4GNT ENSG00000118017 AAAS ENSG00000094914 AACS ENSG00000081760 AADAC ENSG00000114771 AADACL2 ENSG00000197953 AADACL2-AS1 ENSG00000242908 AADACL3 ENSG00000188984 AADACL4 ENSG00000204518 AADAT ENSG00000109576 AAGAB ENSG00000103591 AAK1 ENSG00000115977 AAMDC ENSG00000087884 AAMP ENSG00000127837 AANAT ENSG00000129673 AAR2 ENSG00000131043 AARD ENSG00000205002 AARS1 ENSG00000090861 AARS2 ENSG00000124608 AARSD1 ENSG00000266967 AASDH ENSG00000157426 AASDHPPT ENSG00000149313 AASS ENSG00000008311 AATBC ENSG00000215458 AATF ENSG00000275700 AATK ENSG00000181409 ABALON ENSG00000281376 ABAT ENSG00000183044 ABCA1 ENSG00000165029 ABCA10 ENSG00000154263 ABCA12 ENSG00000144452 ABCA13 ENSG00000179869 ABCA2 ENSG00000107331 ABCA3 ENSG00000167972 ABCA4 ENSG00000198691 ABCA5 ENSG00000154265 ABCA6 ENSG00000154262 ABCA7 ENSG00000064687 ABCA8 ENSG00000141338 ABCA9 ENSG00000154258

The file of donor_82_01.csv is:

,AAACCTGAGCGTTTAC-1,AAACCTGAGTCGCCGT-1,AAACCTGGTAGGACAC-1,AAACCTGGTGCC +TTGG-1,AAACCTGGTTCAGCGC-1 ENSG00000148584,0,0,0,0,0 ENSG00000237613,0,0,0,0,0 ENSG00000186092,0,0,0,0,0 ENSG00000118017,0,0,0,0,0 ENSG00000239945,0,0,0,0,0 ENSG00000205002,0,0,0,0,0 ENSG00000090861,0,0,0,0,0 ENSG00000279928,0,0,0,0,0 ENSG00000181409,0,1,0,1,0 ENSG00000228463,0,0,0,0,0 ENSG00000236743,0,0,0,0,0 ENSG00000165029,0,0,0,0,0 ENSG00000144452,0,0,0,0,0 ENSG00000278566,0,0,0,0,0 ENSG00000179869,0,0,0,0,0 ENSG00000235146,0,0,0,0,0 ENSG00000154262,0,0,0,0,0 ENSG00000141338,0,0,0,0,0 ENSG00000154258,0,0,0,0,0

Replies are listed 'Best First'.
Re: match two files
by Corion (Patriarch) on Jun 03, 2020 at 09:11 UTC

    This is a FAQ. See perlfaq4 on How do I compute the intersection of two arrays?.

    Your code is slow because for every item in @lines1 it looks at all items in @lines2. If you precompute a lookup table ("hash", in Perl data structures) for the items in @lines2, you can find the items in @lines2 much faster.

Re: match two files
by hippo (Bishop) on Jun 03, 2020 at 09:21 UTC
    How I can shorter the running time by change some code?

    Although it's hard to spot because of the random indenting, you have a pair of nested loops. Inside the inner loop you have this line:

    $i++

    which serves absolutely no purpose. The first change you should make is therefore to remove this line.

    Then you might look at your algorithm. Why are you doing the same processing on the entries in @lines2 over and over again? Just process it once, pop the results in a hash for fast lookup and your code will whizz.

    Three more tips:

    Good luck.

Re: match two files
by jwkrahn (Abbot) on Jun 03, 2020 at 12:32 UTC

    This will probably shorten the running time but I don't have your data to test it on, so good luck.

    #!/usr/bin/perl use warnings; use strict; use Fcntl ':seek'; open my $CSV, '<', 'donor_82_01.csv' or die "Cannot open 'donor_82_01. +csv' because: $!"; my $pos = tell $CSV; my %csv_data; while ( <$CSV> ) { my ( $first ) = split /,+/; push @{ $csv_data{ $first } }, $pos; $pos = tell $CSV; } open my $TAB, '<', 'tmp12' or die "Cannot open 'tmp12' because: $!"; open my $OUT, '>', 'tmp12_02' or die "Cannot open 'tmp12_02' because: +$!"; while ( <$TAB> ) { my ( $first, $second ) = split /\t+/; next unless exists $csv_data{ $second }; for my $pos ( @{ $csv_data{ $second } } ) { seek $CSV, $pos, SEEK_SET or die "Cannot seek on 'dono +r_82_01.csv' because: $!"; print $OUT "$first,", scalar <$CSV>; } } close $CSV; close $TAB; close $OUT;

      Hello jwkrahn, Thank you so much for your useful code! Thank you again and really appreciated!

      li@li-HP-$ perl match12.pl Use of uninitialized value $second in exists at match12.pl line 25, <$ +TAB> line 1. Use of uninitialized value $second in hash element at match12.pl line +26, <$TAB> line 1.

        Hi!

        To get rid of the warning messages change the line:

        my ( $first, $second ) = split /\t+/;

        To this:

        my ( $first, $second ) = split or next;
Re: match two files
by perlfan (Vicar) on Jun 03, 2020 at 13:07 UTC
    Here's how I'd do it (for clarity, this was basically suggested in the first reply) - code untested :
    use strict; use warnings; use Tie::Hash::Indexed; tie my %lines1, 'Tie::Hash::Indexed'; # gives you the ordered hash open my $IN1, '<', "tmp12" or die "Cannot open this file: $! +"; open my $IN2, '<', "donor_82_01.csv" or die "Cannot open this file: $? +"; # step 1, cache contents of $IN1 (read the first file once) # populate %lines1 "cache" for my $item1 (<$IN1>) { @tmp1 = split( /\t+/, $item1 ); $lines1{ $tmp[1] } = \@tmp1; # save full $item1 line, keyed on +$tmp[1] } # step 2, iterate over contents of $IN2 / look up in %lines1 to compar +e open my $OUT, '>', "tmp12_01" or die "Cannot open this file: $?"; LOOKUP_AND_COMPARE: for $item2 (@lines2) { #chomp $item2; # not needed, see last line my @tmp2 = split( /\,+/, $item2 ); # -- look up if ( 'ARRAY' eq $lines1{ $tmp2[0] } ) { my @tmp1 = @{ $lines1{ $tmp2[0] } }; # for clarity, not act +ually needed; can get value via "$lines1{ $tmp2[0] }->[0]" print $OUT $tmp1[0], ",", $item2; #<-updated to fix + bareword from old code last LOOKUP_AND_COMPARE; } } #print $OUT "\n"; # probably don't need if you don't "chomp $it +em2"

    Additional optimizations, depending on your constraint (timeversus space):

    • if time, cache the larger of the 2 files
    • if space, cache the smaller of the 2 files

    The lesson here, as stated below is to not nest your loops. It's called "computational complexity". Basically only want to have at most 1 level of looping. The line, if ( 'ARRAY' eq $lines1{ $tmp2[0] } ) { is the "constant time" look up capability that is being provided for by the ordered caching of the first file above and how you avoid the inner loop.

              print OUT $tmp1[0], ",", $item2;

      There is no bareword filehandle OUT anywhere else in your code. Perhaps you meant $OUT? warnings catches these.

        Good catch. for OP's benefit add,
        use strict; use warnings;
        And fixed the bareword file handle. Missed that when updating their code. :) ty....

      Hello perlfan, Thank you so much for your useful code!I already  $ sudo cpan Tie::File::AsHash It still got this error. Thank you again and really appreciated!

      li@lix:~$ perl match11.pl Can't locate Tie/Hash/Indexed.pm in @INC (you may need to install the +Tie::Hash::Indexed module) (@INC contains: /etc/perl /usr/local/lib/x +86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86 +_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/p +erl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_6 +4-linux-gnu/perl-base) at match11.pl line 4. BEGIN failed--compilation aborted at match11.pl line 4.

        "I already $ sudo cpan Tie::File::AsHash It still got this error.

        This module is not used by the code you thanked perlfan for. The error suggests you install Tie::Hash::Indexed, which has many install failures.

        The error message which you quoted not only tells you what's wrong but even goes so far as to suggest what you may need to do in order to fix it. Did you read it? Did you do what it suggested? What happened then?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11117634]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2024-03-28 21:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found