match two files

yueli711 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: match two files by GrandFather (Saint) on Dec 09, 2020 at 06:24 UTC
Most problems of this sort are easiest to code using a hash to store the important content of one of the files. In this case I just picked the first file, but a better choice for a real world version would probably read the second file first. Note that the code processing the first file builds a list of protein IDs for later use. #!/usr/bin/perl use warnings; use strict; my $tmp01 = <<FILE1; PeptideID ProteinID 6 109521 7 741 11 681 11 780 20 2352 27 1490 27 1491 27 1492 28 51996 29 1490 29 1491 29 1492 30 1490 30 1491 30 1492 FILE1 my $tmp02 = <<FILE2; PeptideID SpectrumID Sequence 6 53663 KMGEGR 7 53663 KPPSGK 11 144492 NNDALR 20 15547 SPAKPK 27 55547 LHKPPK 28 55547 LFVGRK 29 55504 LHKPPK 30 55602 LHKPPK FILE2 my $tmp11_QUICK = ''; my %peptides; open my $TAB01, '<', \$tmp01; while (<$TAB01>) { chomp; my ($peptide, $protein) = split /\t+/; next if $peptide !~ /\d/; #$peptides{$peptide} //= []; push @{$peptides{$peptide}}, $protein; } open my $TAB02, '<', \$tmp02; open my $OUT, '>', \$tmp11_QUICK; while (<$TAB02>) { chomp; my ($peptide, $spectrum, $sequence) = split /\t+/; next if $peptide !~ /\d/; print $OUT "$peptide\t$_\t$spectrum\t$sequence\n" for @{$peptides{ +$peptide}}; } close $OUT; print $tmp11_QUICK; [download] Prints: `6 109521 53663 KMGEGR 7 741 53663 KPPSGK 11 681 144492 NNDALR 11 780 144492 NNDALR 20 2352 15547 SPAKPK 27 1490 55547 LHKPPK 27 1491 55547 LHKPPK 27 1492 55547 LHKPPK 28 51996 55547 LFVGRK 29 1490 55504 LHKPPK 29 1491 55504 LHKPPK 29 1492 55504 LHKPPK 30 1490 55602 LHKPPK 30 1491 55602 LHKPPK 30 1492 55602 LHKPPK` [download] Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re^2: match two files by yueli711 (Sexton) on Dec 09, 2020 at 07:00 UTC
Hello GrandFather, Thank you so much for your great help! Really appreciate your great help! Best, Yue #!/usr/bin/perl use warnings; use strict; open my $TAB01, '<', 'tmp01_quick' or die "Cannot open 'donor_82_01.cs +v' because: $!"; open my $TAB02, '<', 'tmp02_quick' or die "Cannot open 'tmp12' because +: $!"; open my $OUT, '>', 'tmp12_01_QUICK' or die "Cannot open 'tmp12_01' bec +ause: $!"; my $tmp11_QUICK = ''; my %peptides; #open my $TAB01, '<', \$tmp01_quick; while (<$TAB01>) { chomp; my ($peptide, $protein) = split /\t+/; next if $peptide !~ /\d/; #$peptides{$peptide} //= []; push @{$peptides{$peptide}}, $protein; } #open my $TAB02, '<', \$tmp02_quick; #open my $OUT, '>', \$tmp11_QUICK; while (<$TAB02>) { chomp; my ($peptide, $spectrum, $sequence) = split /\t+/; next if $peptide !~ /\d/; print $OUT "$peptide\t$_\t$spectrum\t$sequence\n" for @{$peptides{ +$peptide}}; } close $OUT; print $tmp11_QUICK; [download]	[reply] [d/l]
Re: match two files by AnomalousMonk (Archbishop) on Dec 09, 2020 at 06:01 UTC
`my ( $first,$second) = split /t+/;` `...` `my ( $first,$second, $third ) = split /t+/ ;` No complete solution, but a quick tip: In the two quoted statements, if you want to split `$_` on multiple tabs, use `/\t+/` (note the backslash). `my ( $first,$second) = split /\t+/;` Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: match two files by yueli711 (Sexton) on Dec 09, 2020 at 06:24 UTC
Hello AnomalousMonk, Thank you so much for your help! It still can not print out what I wanted. Best, Yue `ARRAY(0x56213ba811e0)PeptideID ProteinID ARRAY(0x56213ba80a48)6 109521 ARRAY(0x56213baa3148)7 741 ARRAY(0x56213ba75888)11 681 ARRAY(0x56213ba75888)11 780 ARRAY(0x56213ba759c0)20 2352 ARRAY(0x56213bb278e8)27 1490 ARRAY(0x56213bb278e8)27 1491 ARRAY(0x56213bb278e8)27 1492 ARRAY(0x56213bb27bd0)28 51996` [download]	[reply] [d/l]
Re: match two files by haukex (Archbishop) on Dec 09, 2020 at 12:31 UTC
Although overkill for the example you showed, it might be useful to see how this can be done with SQL. The following uses DBD::CSV so it can work with your input files directly to produce your expected output, though for "production" work you would probably want to use a real database. Also, my output code is somewhat simplistic, one could use Text::CSV(_XS) for that purpose as well. use warnings; use strict; use DBI; my $dbh = DBI->connect("dbi:CSV:", undef, undef, { csv_sep_char => "\t", f_ext => '', RaiseError => 1, }) or die "Cannot connect: $DBI::errstr"; my $sth = $dbh->prepare(<<'ENDSQL'); SELECT tmp01.PeptideID as PeptideID, tmp01.ProteinID as ProteinID, tmp02.SpectrumID as SpectrumID, tmp02.Sequence as Sequence FROM tmp01 LEFT OUTER JOIN tmp02 ON tmp01.PeptideID = tmp02.PeptideID ENDSQL $sth->execute; print join("\t", @{ $sth->{NAME} } ), "\n"; while ( my $row = $sth->fetchrow_arrayref ) { print join("\t", @$row ), "\n"; } [download] Update: Also make sure to read up on the different kinds of SQL JOINs to see the difference between those and which one is appropriate for your case.	[reply] [d/l]
Re^2: match two files by yueli711 (Sexton) on Dec 09, 2020 at 18:17 UTC
Hello haukex, Thank yo so much for your great help! Thank you again! Best, Yue	[reply]
Re: match two files by siberia-man (Friar) on Dec 09, 2020 at 11:16 UTC
The script below produces the same output as in your example output. Run it as follows (assuming the script is stored as `z`): `perl z tmp01 tmp02` It's formatted widespreadly for better readability. #!/usr/bin/env perl use strict; use warnings; my $seen; while ( <> ) { next unless /^\d/; s/\s*$//; next unless m/ # [file1] [file2] ^ (\S+) # PeptideID PeptideID \s+ (\S+) # ProteinID SpectrumID (?: \s+ (\S+) # ----- Sequence )? $ /x; if ( $3 ) { $seen->{$1}->{SpectrumID} = $2; $seen->{$1}->{Sequence} = $3; } else { push @{ $seen->{$1}->{ProteinID} }, $2; } } sub frmt { print join("\t", @_) . "\n"; } frmt qw( PeptideID ProteinID SpectrumID Sequence ); foreach my $k ( sort { $a <=> $b } keys %{ $seen } ) { foreach my $p ( @{ $seen->{$k}->{ProteinID} } ) { frmt $k, $p, $seen->{$k}->{SpectrumID}, $seen->{$k}->{Sequence +}; } } [download]	[reply] [d/l] [select]
Re^2: match two files by yueli711 (Sexton) on Dec 09, 2020 at 16:35 UTC
Hello siberia-man, Thank you so much for your great help!. There are still some errors. Thank yo again and really appreciated! Best, Yue `perl match_quick02.pl tmp01_quick tmp02_quick syntax error at match_quick02.pl line 42, near "+}" syntax error at match_quick02.pl line 44, near "}" Execution of match_quick02.pl aborted due to compilation errors.` [download]	[reply] [d/l]
Re^3: match two files by huck (Prior) on Dec 09, 2020 at 16:47 UTC
I suspect you used cut+paste rather than the download link, for the `+}` your error shows is not in the actual code anywhere.	[reply] [d/l]
Re^4: match two files by yueli711 (Sexton) on Dec 09, 2020 at 17:52 UTC
Re: match two files by tybalt89 (Monsignor) on Dec 09, 2020 at 11:31 UTC
`#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11124864 use warnings; use Path::Tiny; my %two = map /^(\d+)(.)/, path('tmp02')->lines; my $out = "PeptideID ProteinID SpectrumID Sequence\n"; s/^(\d+)(.)/$1$2$two{$1}/ and $out .= $_ for path('tmp01')->lines; path('tmp11_quick')->spew($out);` [download]	[reply] [d/l]
Re^2: match two files by yueli711 (Sexton) on Dec 09, 2020 at 16:45 UTC
Hello tybalt89, Thank yo so much for your great help! It works, but still has a problem. Thank you again and really appreciated! Best, Yue `Use of uninitialized value in concatenation (.) or string at match_qui +ck03.pl line 9.` [download]	[reply] [d/l]
Re^3: match two files by tybalt89 (Monsignor) on Dec 09, 2020 at 17:27 UTC
Please give a (small) dataset that shows that problem.	[reply]
Re^4: match two files by yueli711 (Sexton) on Dec 09, 2020 at 18:13 UTC
Re^5: match two files by tybalt89 (Monsignor) on Dec 09, 2020 at 19:03 UTC
Re: match two files by kcott (Archbishop) on Dec 10, 2020 at 06:41 UTC
G'day yueli711, This uses the same principle as ++GrandFather described. See Notes below for differences and other features. #!/usr/bin/env perl use strict; use warnings; use autodie; my ($in1, $in2, $out) = qw{tmp01 tmp02 tmp11_quick}; my (%data, @headings); { open my $fh, '<', $in1; while (<$fh>) { if ($. == 1) { push @headings, split; } else { my ($pep, $prot) = split; push @{$data{$pep}}, $prot; } } } { my $fmt = "%-9s %-9s %-10s %-8s\n"; open my $fh_in, '<', $in2; open my $fh_out, '>', $out; while (<$fh_in>) { my ($id, @rest) = split; if ($. == 1) { printf $fh_out $fmt, @headings, @rest; } else { for (@{$data{$id}}) { printf $fh_out $fmt, $id, $_, @rest; } } } } [download] Output: `PeptideID ProteinID SpectrumID Sequence 6 109521 53663 KMGEGR 7 741 53663 KPPSGK 11 681 144492 NNDALR 11 780 144492 NNDALR 20 2352 15547 SPAKPK 27 1490 55547 LHKPPK 27 1491 55547 LHKPPK 27 1492 55547 LHKPPK 28 51996 55547 LFVGRK 29 1490 55504 LHKPPK 29 1491 55504 LHKPPK 29 1492 55504 LHKPPK 30 1490 55602 LHKPPK 30 1491 55602 LHKPPK 30 1492 55602 LHKPPK` [download] Notes: This code deals with real files, not in-memory files. All I/O is performed in anonymous blocks. Files are only open while needed. Filehandles close automatically at the end of these blocks. autodie removes the need to hand-craft your own I/O exception messages. It also won't make mistakes like you have in a later post: file with I/O problem is `tmp01_quick`; message refers to a different file, i.e. `donor_82_01.csv`. (You have three errors like that.) When I copied your data to files on my system, the tabs became spaces. I only needed `split` without arguments; you should continue to use `split /\t+/`. You should also take note of chomp used in GrandFather's code. I've used printf to improve output formatting; however, that may not be what you want. You should also take a look at Text::CSV. (Note, it runs faster if you also have Text::CSV_XS installed.) — Ken	[reply] [d/l] [select]
Re: match two files by leszekdubiel (Scribe) on Dec 10, 2020 at 20:10 UTC
Here is simple solution: #!/usr/bin/perl my @tmp1 = map { chomp; [ split /\s+/, $_, -1 ] } `cat tmp01`; # () my %tmp2 = map { chomp; my @a = split /\s+/, $_, -1; ($a[0], \@a) } `cat tmp02`; print join "\t" @$_, @{ $tmp2{$$_[0]} }, "\n" for @tmp1; # () instead of `cat tmp01` it is more safe to: #use Path::Tiny; #path("tmp01")->lines_utf8({chomp => 1}); [download]	[reply] [d/l]
Re^2: match two files by GrandFather (Saint) on Dec 10, 2020 at 21:01 UTC
Here is simple solution: "simple" by what measure? Fewer statements does not equate to simple. If using Path::Tiny is safer why do you demonstrate less safe code? Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply]
Re^3: match two files by leszekdubiel (Scribe) on Dec 11, 2020 at 21:31 UTC
I prefer to read whole files, and print them joined. All at once. I don't like algorithmic C-style flow: open file, read line by line, process line, print line... One can use whichever solution is better for his situation -- `cat file` or Path::Tiny... I don't know how to slurp whole file other simpler way.	[reply]
Re^2: match two files by afoken (Chancellor) on Dec 11, 2020 at 15:42 UTC
Here is simple solution: #!/usr/bin/perl my @tmp1 = map { chomp; [ split /\s+/, $_, -1 ] } `cat tmp01`; # () my %tmp2 = map { chomp; my @a = split /\s+/, $_, -1; ($a[0], \@a) } `cat tmp02`; print join "\t" @$_, @{ $tmp2{$$_[0]} }, "\n" for @tmp1; # () instead of `cat tmp01` it is more safe to: #use Path::Tiny; #path("tmp01")->lines_utf8({chomp => 1}); [download] Congratulations, you have won not just one, but two Useless Use of Cat Awards! Further more, you have prepared two shell injection vulnerabilities (The problem of "the" default shell), you artifically restricted the "solution" to Unix systems (Windows and many other systems have no cat), and you are limiting the sum of the input file sizes to significantly less than available RAM and swap by not reading line-by-line, but instead reading both input files all at once. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks