Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

match two files

by yueli711 (Sexton)
on Dec 09, 2020 at 05:21 UTC ( [id://11124864]=perlquestion: print w/replies, xml ) Need Help??

yueli711 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I want to quickly match two files according to the first column. Thank you in advance for any help! Best, Yue

Input tmp01: PeptideID ProteinID 6 109521 7 741 11 681 11 780 20 2352 27 1490 27 1491 27 1492 28 51996 29 1490 29 1491 29 1492 30 1490 30 1491 30 1492 Input tmp02: PeptideID SpectrumID Sequence 6 53663 KMGEGR 7 53663 KPPSGK 11 144492 NNDALR 20 15547 SPAKPK 27 55547 LHKPPK 28 55547 LFVGRK 29 55504 LHKPPK 30 55602 LHKPPK Output tmp11_quick: PeptideID ProteinID SpectrumID Sequence 6 109521 53663 KMGEGR 7 741 53663 KPPSGK 11 681 144492 NNDALR 11 780 144492 NNDALR 20 2352 15547 SPAKPK 27 1490 55547 LHKPPK 27 1491 55547 LHKPPK 27 1492 55547 LHKPPK 28 51996 55547 LFVGRK 29 1490 55504 LHKPPK 29 1491 55504 LHKPPK 29 1492 55504 LHKPPK 30 1490 55602 LHKPPK 30 1491 55602 LHKPPK 30 1492 55602 LHKPPK #!/usr/bin/perl use warnings; use strict; use Fcntl ':seek'; open my $TAB01, '<', 'tmp01' or die "Cannot open 'tmp01' because: $!"; open my $TAB02, '<', 'tmp02' or die "Cannot open 'tmp02' because: $!"; open my $OUT, '>', 'tmp11_QUICK' or die "Cannot open 'tmp12_01' becaus +e: $!"; my $pos = tell $TAB01; my %tab01_data; while ( <$TAB01> ) { my ( $first,$second) = split /\t+/; print $OUT ",$_" unless length $first; push @{ $tab01_data{ $first } }, $pos; $pos = tell $TAB01; } my %tab02_data; while ( <$TAB02> ) { my ( $first,$second, $third ) = split /\t+/ ; next unless exists $tab01_data{ $first }; for my $pos ( @{ $tab01_data{ $first } } ) { seek $TAB01, $pos, SEEK_SET or die "Cannot seek on 'tmp01' bec +ause: $!"; print $OUT "$tab01_data","$tab02_data{$second}" ,scalar <$TAB0 +1>; } } close $TAB01; close $TAB02; close $OUT;

Replies are listed 'Best First'.
Re: match two files
by GrandFather (Saint) on Dec 09, 2020 at 06:24 UTC

    Most problems of this sort are easiest to code using a hash to store the important content of one of the files. In this case I just picked the first file, but a better choice for a real world version would probably read the second file first.

    Note that the code processing the first file builds a list of protein IDs for later use.

    #!/usr/bin/perl use warnings; use strict; my $tmp01 = <<FILE1; PeptideID ProteinID 6 109521 7 741 11 681 11 780 20 2352 27 1490 27 1491 27 1492 28 51996 29 1490 29 1491 29 1492 30 1490 30 1491 30 1492 FILE1 my $tmp02 = <<FILE2; PeptideID SpectrumID Sequence 6 53663 KMGEGR 7 53663 KPPSGK 11 144492 NNDALR 20 15547 SPAKPK 27 55547 LHKPPK 28 55547 LFVGRK 29 55504 LHKPPK 30 55602 LHKPPK FILE2 my $tmp11_QUICK = ''; my %peptides; open my $TAB01, '<', \$tmp01; while (<$TAB01>) { chomp; my ($peptide, $protein) = split /\t+/; next if $peptide !~ /\d/; #$peptides{$peptide} //= []; push @{$peptides{$peptide}}, $protein; } open my $TAB02, '<', \$tmp02; open my $OUT, '>', \$tmp11_QUICK; while (<$TAB02>) { chomp; my ($peptide, $spectrum, $sequence) = split /\t+/; next if $peptide !~ /\d/; print $OUT "$peptide\t$_\t$spectrum\t$sequence\n" for @{$peptides{ +$peptide}}; } close $OUT; print $tmp11_QUICK;

    Prints:

    6 109521 53663 KMGEGR 7 741 53663 KPPSGK 11 681 144492 NNDALR 11 780 144492 NNDALR 20 2352 15547 SPAKPK 27 1490 55547 LHKPPK 27 1491 55547 LHKPPK 27 1492 55547 LHKPPK 28 51996 55547 LFVGRK 29 1490 55504 LHKPPK 29 1491 55504 LHKPPK 29 1492 55504 LHKPPK 30 1490 55602 LHKPPK 30 1491 55602 LHKPPK 30 1492 55602 LHKPPK
    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
      Hello GrandFather, Thank you so much for your great help! Really appreciate your great help! Best, Yue
      #!/usr/bin/perl use warnings; use strict; open my $TAB01, '<', 'tmp01_quick' or die "Cannot open 'donor_82_01.cs +v' because: $!"; open my $TAB02, '<', 'tmp02_quick' or die "Cannot open 'tmp12' because +: $!"; open my $OUT, '>', 'tmp12_01_QUICK' or die "Cannot open 'tmp12_01' bec +ause: $!"; my $tmp11_QUICK = ''; my %peptides; #open my $TAB01, '<', \$tmp01_quick; while (<$TAB01>) { chomp; my ($peptide, $protein) = split /\t+/; next if $peptide !~ /\d/; #$peptides{$peptide} //= []; push @{$peptides{$peptide}}, $protein; } #open my $TAB02, '<', \$tmp02_quick; #open my $OUT, '>', \$tmp11_QUICK; while (<$TAB02>) { chomp; my ($peptide, $spectrum, $sequence) = split /\t+/; next if $peptide !~ /\d/; print $OUT "$peptide\t$_\t$spectrum\t$sequence\n" for @{$peptides{ +$peptide}}; } close $OUT; print $tmp11_QUICK;
Re: match two files
by AnomalousMonk (Archbishop) on Dec 09, 2020 at 06:01 UTC
    my ( $first,$second) = split /t+/;
    ...
    my ( $first,$second, $third ) = split /t+/ ;

    No complete solution, but a quick tip: In the two quoted statements, if you want to split $_ on multiple tabs, use /\t+/ (note the backslash).
        my ( $first,$second) = split /\t+/;


    Give a man a fish:  <%-{-{-{-<

      Hello AnomalousMonk, Thank you so much for your help! It still can not print out what I wanted. Best, Yue

      ARRAY(0x56213ba811e0)PeptideID ProteinID ARRAY(0x56213ba80a48)6 109521 ARRAY(0x56213baa3148)7 741 ARRAY(0x56213ba75888)11 681 ARRAY(0x56213ba75888)11 780 ARRAY(0x56213ba759c0)20 2352 ARRAY(0x56213bb278e8)27 1490 ARRAY(0x56213bb278e8)27 1491 ARRAY(0x56213bb278e8)27 1492 ARRAY(0x56213bb27bd0)28 51996
Re: match two files
by haukex (Archbishop) on Dec 09, 2020 at 12:31 UTC

    Although overkill for the example you showed, it might be useful to see how this can be done with SQL. The following uses DBD::CSV so it can work with your input files directly to produce your expected output, though for "production" work you would probably want to use a real database. Also, my output code is somewhat simplistic, one could use Text::CSV(_XS) for that purpose as well.

    use warnings; use strict; use DBI; my $dbh = DBI->connect("dbi:CSV:", undef, undef, { csv_sep_char => "\t", f_ext => '', RaiseError => 1, }) or die "Cannot connect: $DBI::errstr"; my $sth = $dbh->prepare(<<'ENDSQL'); SELECT tmp01.PeptideID as PeptideID, tmp01.ProteinID as ProteinID, tmp02.SpectrumID as SpectrumID, tmp02.Sequence as Sequence FROM tmp01 LEFT OUTER JOIN tmp02 ON tmp01.PeptideID = tmp02.PeptideID ENDSQL $sth->execute; print join("\t", @{ $sth->{NAME} } ), "\n"; while ( my $row = $sth->fetchrow_arrayref ) { print join("\t", @$row ), "\n"; }

    Update: Also make sure to read up on the different kinds of SQL JOINs to see the difference between those and which one is appropriate for your case.

      Hello haukex, Thank yo so much for your great help! Thank you again! Best, Yue
Re: match two files
by siberia-man (Friar) on Dec 09, 2020 at 11:16 UTC
    The script below produces the same output as in your example output. Run it as follows (assuming the script is stored as z):
    perl z tmp01 tmp02

    It's formatted widespreadly for better readability.
    #!/usr/bin/env perl use strict; use warnings; my $seen; while ( <> ) { next unless /^\d/; s/\s*$//; next unless m/ # [file1] [file2] ^ (\S+) # PeptideID PeptideID \s+ (\S+) # ProteinID SpectrumID (?: \s+ (\S+) # ----- Sequence )? $ /x; if ( $3 ) { $seen->{$1}->{SpectrumID} = $2; $seen->{$1}->{Sequence} = $3; } else { push @{ $seen->{$1}->{ProteinID} }, $2; } } sub frmt { print join("\t", @_) . "\n"; } frmt qw( PeptideID ProteinID SpectrumID Sequence ); foreach my $k ( sort { $a <=> $b } keys %{ $seen } ) { foreach my $p ( @{ $seen->{$k}->{ProteinID} } ) { frmt $k, $p, $seen->{$k}->{SpectrumID}, $seen->{$k}->{Sequence +}; } }

      Hello siberia-man, Thank you so much for your great help!. There are still some errors. Thank yo again and really appreciated! Best, Yue

      perl match_quick02.pl tmp01_quick tmp02_quick syntax error at match_quick02.pl line 42, near "+}" syntax error at match_quick02.pl line 44, near "}" Execution of match_quick02.pl aborted due to compilation errors.

        I suspect you used cut+paste rather than the download link, for the  +} your error shows is not in the actual code anywhere.

Re: match two files
by tybalt89 (Monsignor) on Dec 09, 2020 at 11:31 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11124864 use warnings; use Path::Tiny; my %two = map /^(\d+)(.*)/, path('tmp02')->lines; my $out = "PeptideID ProteinID SpectrumID Sequence\n"; s/^(\d+)(.*)/$1$2$two{$1}/ and $out .= $_ for path('tmp01')->lines; path('tmp11_quick')->spew($out);

      Hello tybalt89, Thank yo so much for your great help! It works, but still has a problem. Thank you again and really appreciated! Best, Yue

      Use of uninitialized value in concatenation (.) or string at match_qui +ck03.pl line 9.

        Please give a (small) dataset that shows that problem.

Re: match two files
by kcott (Archbishop) on Dec 10, 2020 at 06:41 UTC

    G'day yueli711,

    This uses the same principle as ++GrandFather described. See Notes below for differences and other features.

    #!/usr/bin/env perl use strict; use warnings; use autodie; my ($in1, $in2, $out) = qw{tmp01 tmp02 tmp11_quick}; my (%data, @headings); { open my $fh, '<', $in1; while (<$fh>) { if ($. == 1) { push @headings, split; } else { my ($pep, $prot) = split; push @{$data{$pep}}, $prot; } } } { my $fmt = "%-9s %-9s %-10s %-8s\n"; open my $fh_in, '<', $in2; open my $fh_out, '>', $out; while (<$fh_in>) { my ($id, @rest) = split; if ($. == 1) { printf $fh_out $fmt, @headings, @rest; } else { for (@{$data{$id}}) { printf $fh_out $fmt, $id, $_, @rest; } } } }

    Output:

    PeptideID ProteinID SpectrumID Sequence 6 109521 53663 KMGEGR 7 741 53663 KPPSGK 11 681 144492 NNDALR 11 780 144492 NNDALR 20 2352 15547 SPAKPK 27 1490 55547 LHKPPK 27 1491 55547 LHKPPK 27 1492 55547 LHKPPK 28 51996 55547 LFVGRK 29 1490 55504 LHKPPK 29 1491 55504 LHKPPK 29 1492 55504 LHKPPK 30 1490 55602 LHKPPK 30 1491 55602 LHKPPK 30 1492 55602 LHKPPK

    Notes:

    • This code deals with real files, not in-memory files.
    • All I/O is performed in anonymous blocks. Files are only open while needed. Filehandles close automatically at the end of these blocks.
    • autodie removes the need to hand-craft your own I/O exception messages. It also won't make mistakes like you have in a later post: file with I/O problem is tmp01_quick; message refers to a different file, i.e. donor_82_01.csv. (You have three errors like that.)
    • When I copied your data to files on my system, the tabs became spaces. I only needed split without arguments; you should continue to use split /\t+/. You should also take note of chomp used in GrandFather's code.
    • I've used printf to improve output formatting; however, that may not be what you want.

    You should also take a look at Text::CSV. (Note, it runs faster if you also have Text::CSV_XS installed.)

    — Ken

Re: match two files
by leszekdubiel (Scribe) on Dec 10, 2020 at 20:10 UTC

    Here is simple solution:

    #!/usr/bin/perl my @tmp1 = map { chomp; [ split /\s+/, $_, -1 ] } `cat tmp01`; # (*) my %tmp2 = map { chomp; my @a = split /\s+/, $_, -1; ($a[0], \@a) } `cat tmp02`; print join "\t" @$_, @{ $tmp2{$$_[0]} }, "\n" for @tmp1; # (*) instead of `cat tmp01` it is more safe to: #use Path::Tiny; #path("tmp01")->lines_utf8({chomp => 1});
      Here is simple solution:

      "simple" by what measure? Fewer statements does not equate to simple.

      If using Path::Tiny is safer why do you demonstrate less safe code?

      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

        I prefer to read whole files, and print them joined. All at once. I don't like algorithmic C-style flow: open file, read line by line, process line, print line...

        One can use whichever solution is better for his situation -- `cat file` or Path::Tiny... I don't know how to slurp whole file other simpler way.

      Here is simple solution:

      #!/usr/bin/perl my @tmp1 = map { chomp; [ split /\s+/, $_, -1 ] } `cat tmp01`; # (*) my %tmp2 = map { chomp; my @a = split /\s+/, $_, -1; ($a[0], \@a) } `cat tmp02`; print join "\t" @$_, @{ $tmp2{$$_[0]} }, "\n" for @tmp1; # (*) instead of `cat tmp01` it is more safe to: #use Path::Tiny; #path("tmp01")->lines_utf8({chomp => 1});

      Congratulations, you have won not just one, but two Useless Use of Cat Awards!

      Further more, you have prepared two shell injection vulnerabilities (The problem of "the" default shell), you artifically restricted the "solution" to Unix systems (Windows and many other systems have no cat), and you are limiting the sum of the input file sizes to significantly less than available RAM and swap by not reading line-by-line, but instead reading both input files all at once.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11124864]
Approved by GrandFather
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-18 07:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found