yueli711 has asked for the wisdom of the Perl Monks concerning the following question:
Hello,
I want to quickly match two files according to the first column.
Thank you in advance for any help!
Best,
Yue
Input tmp01:
PeptideID ProteinID
6 109521
7 741
11 681
11 780
20 2352
27 1490
27 1491
27 1492
28 51996
29 1490
29 1491
29 1492
30 1490
30 1491
30 1492
Input tmp02:
PeptideID SpectrumID Sequence
6 53663 KMGEGR
7 53663 KPPSGK
11 144492 NNDALR
20 15547 SPAKPK
27 55547 LHKPPK
28 55547 LFVGRK
29 55504 LHKPPK
30 55602 LHKPPK
Output tmp11_quick:
PeptideID ProteinID SpectrumID Sequence
6 109521 53663 KMGEGR
7 741 53663 KPPSGK
11 681 144492 NNDALR
11 780 144492 NNDALR
20 2352 15547 SPAKPK
27 1490 55547 LHKPPK
27 1491 55547 LHKPPK
27 1492 55547 LHKPPK
28 51996 55547 LFVGRK
29 1490 55504 LHKPPK
29 1491 55504 LHKPPK
29 1492 55504 LHKPPK
30 1490 55602 LHKPPK
30 1491 55602 LHKPPK
30 1492 55602 LHKPPK
#!/usr/bin/perl
use warnings;
use strict;
use Fcntl ':seek';
open my $TAB01, '<', 'tmp01' or die "Cannot open 'tmp01' because: $!";
open my $TAB02, '<', 'tmp02' or die "Cannot open 'tmp02' because: $!";
open my $OUT, '>', 'tmp11_QUICK' or die "Cannot open 'tmp12_01' becaus
+e: $!";
my $pos = tell $TAB01;
my %tab01_data;
while ( <$TAB01> ) {
my ( $first,$second) = split /\t+/;
print $OUT ",$_" unless length $first;
push @{ $tab01_data{ $first } }, $pos;
$pos = tell $TAB01;
}
my %tab02_data;
while ( <$TAB02> ) {
my ( $first,$second, $third ) = split /\t+/ ;
next unless exists $tab01_data{ $first };
for my $pos ( @{ $tab01_data{ $first } } ) {
seek $TAB01, $pos, SEEK_SET or die "Cannot seek on 'tmp01' bec
+ause: $!";
print $OUT "$tab01_data","$tab02_data{$second}" ,scalar <$TAB0
+1>;
}
}
close $TAB01;
close $TAB02;
close $OUT;
Re: match two files
by GrandFather (Saint) on Dec 09, 2020 at 06:24 UTC
|
Most problems of this sort are easiest to code using a hash to store the important content of one of the files. In this case I just picked the first file, but a better choice for a real world version would probably read the second file first.
Note that the code processing the first file builds a list of protein IDs for later use.
#!/usr/bin/perl
use warnings;
use strict;
my $tmp01 = <<FILE1;
PeptideID ProteinID
6 109521
7 741
11 681
11 780
20 2352
27 1490
27 1491
27 1492
28 51996
29 1490
29 1491
29 1492
30 1490
30 1491
30 1492
FILE1
my $tmp02 = <<FILE2;
PeptideID SpectrumID Sequence
6 53663 KMGEGR
7 53663 KPPSGK
11 144492 NNDALR
20 15547 SPAKPK
27 55547 LHKPPK
28 55547 LFVGRK
29 55504 LHKPPK
30 55602 LHKPPK
FILE2
my $tmp11_QUICK = '';
my %peptides;
open my $TAB01, '<', \$tmp01;
while (<$TAB01>) {
chomp;
my ($peptide, $protein) = split /\t+/;
next if $peptide !~ /\d/;
#$peptides{$peptide} //= [];
push @{$peptides{$peptide}}, $protein;
}
open my $TAB02, '<', \$tmp02;
open my $OUT, '>', \$tmp11_QUICK;
while (<$TAB02>) {
chomp;
my ($peptide, $spectrum, $sequence) = split /\t+/;
next if $peptide !~ /\d/;
print $OUT "$peptide\t$_\t$spectrum\t$sequence\n" for @{$peptides{
+$peptide}};
}
close $OUT;
print $tmp11_QUICK;
Prints:
6 109521 53663 KMGEGR
7 741 53663 KPPSGK
11 681 144492 NNDALR
11 780 144492 NNDALR
20 2352 15547 SPAKPK
27 1490 55547 LHKPPK
27 1491 55547 LHKPPK
27 1492 55547 LHKPPK
28 51996 55547 LFVGRK
29 1490 55504 LHKPPK
29 1491 55504 LHKPPK
29 1492 55504 LHKPPK
30 1490 55602 LHKPPK
30 1491 55602 LHKPPK
30 1492 55602 LHKPPK
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
|
Hello GrandFather,
Thank you so much for your great help!
Really appreciate your great help!
Best,
Yue
#!/usr/bin/perl
use warnings;
use strict;
open my $TAB01, '<', 'tmp01_quick' or die "Cannot open 'donor_82_01.cs
+v' because: $!";
open my $TAB02, '<', 'tmp02_quick' or die "Cannot open 'tmp12' because
+: $!";
open my $OUT, '>', 'tmp12_01_QUICK' or die "Cannot open 'tmp12_01' bec
+ause: $!";
my $tmp11_QUICK = '';
my %peptides;
#open my $TAB01, '<', \$tmp01_quick;
while (<$TAB01>) {
chomp;
my ($peptide, $protein) = split /\t+/;
next if $peptide !~ /\d/;
#$peptides{$peptide} //= [];
push @{$peptides{$peptide}}, $protein;
}
#open my $TAB02, '<', \$tmp02_quick;
#open my $OUT, '>', \$tmp11_QUICK;
while (<$TAB02>) {
chomp;
my ($peptide, $spectrum, $sequence) = split /\t+/;
next if $peptide !~ /\d/;
print $OUT "$peptide\t$_\t$spectrum\t$sequence\n" for @{$peptides{
+$peptide}};
}
close $OUT;
print $tmp11_QUICK;
| [reply] [d/l] |
Re: match two files
by AnomalousMonk (Archbishop) on Dec 09, 2020 at 06:01 UTC
|
my ( $first,$second) = split /t+/;
...
my ( $first,$second, $third ) = split /t+/ ;
No complete solution, but a quick tip: In the two quoted statements,
if you want to split $_ on multiple tabs, use /\t+/
(note the backslash).
my ( $first,$second) = split /\t+/;
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
ARRAY(0x56213ba811e0)PeptideID ProteinID
ARRAY(0x56213ba80a48)6 109521
ARRAY(0x56213baa3148)7 741
ARRAY(0x56213ba75888)11 681
ARRAY(0x56213ba75888)11 780
ARRAY(0x56213ba759c0)20 2352
ARRAY(0x56213bb278e8)27 1490
ARRAY(0x56213bb278e8)27 1491
ARRAY(0x56213bb278e8)27 1492
ARRAY(0x56213bb27bd0)28 51996
| [reply] [d/l] |
Re: match two files
by haukex (Archbishop) on Dec 09, 2020 at 12:31 UTC
|
Although overkill for the example you showed, it might be useful to see how this can be done with SQL. The following uses DBD::CSV so it can work with your input files directly to produce your expected output, though for "production" work you would probably want to use a real database. Also, my output code is somewhat simplistic, one could use Text::CSV(_XS) for that purpose as well.
use warnings;
use strict;
use DBI;
my $dbh = DBI->connect("dbi:CSV:", undef, undef, {
csv_sep_char => "\t", f_ext => '', RaiseError => 1,
}) or die "Cannot connect: $DBI::errstr";
my $sth = $dbh->prepare(<<'ENDSQL');
SELECT
tmp01.PeptideID as PeptideID,
tmp01.ProteinID as ProteinID,
tmp02.SpectrumID as SpectrumID,
tmp02.Sequence as Sequence
FROM tmp01
LEFT OUTER JOIN tmp02
ON tmp01.PeptideID = tmp02.PeptideID
ENDSQL
$sth->execute;
print join("\t", @{ $sth->{NAME} } ), "\n";
while ( my $row = $sth->fetchrow_arrayref ) {
print join("\t", @$row ), "\n";
}
Update: Also make sure to read up on the different kinds of SQL JOINs to see the difference between those and which one is appropriate for your case. | [reply] [d/l] |
|
Hello haukex,
Thank yo so much for your great help!
Thank you again!
Best,
Yue
| [reply] |
Re: match two files
by siberia-man (Friar) on Dec 09, 2020 at 11:16 UTC
|
The script below produces the same output as in your example output. Run it as follows (assuming the script is stored as z):
perl z tmp01 tmp02
It's formatted widespreadly for better readability.
#!/usr/bin/env perl
use strict;
use warnings;
my $seen;
while ( <> ) {
next unless /^\d/;
s/\s*$//;
next unless m/ # [file1] [file2]
^
(\S+) # PeptideID PeptideID
\s+
(\S+) # ProteinID SpectrumID
(?:
\s+
(\S+) # ----- Sequence
)?
$
/x;
if ( $3 ) {
$seen->{$1}->{SpectrumID} = $2;
$seen->{$1}->{Sequence} = $3;
} else {
push @{ $seen->{$1}->{ProteinID} }, $2;
}
}
sub frmt {
print join("\t", @_) . "\n";
}
frmt qw( PeptideID ProteinID SpectrumID Sequence );
foreach my $k ( sort { $a <=> $b } keys %{ $seen } ) {
foreach my $p ( @{ $seen->{$k}->{ProteinID} } ) {
frmt $k, $p, $seen->{$k}->{SpectrumID}, $seen->{$k}->{Sequence
+};
}
}
| [reply] [d/l] [select] |
|
perl match_quick02.pl tmp01_quick tmp02_quick
syntax error at match_quick02.pl line 42, near "+}"
syntax error at match_quick02.pl line 44, near "}"
Execution of match_quick02.pl aborted due to compilation errors.
| [reply] [d/l] |
|
| [reply] [d/l] |
|
Re: match two files
by tybalt89 (Monsignor) on Dec 09, 2020 at 11:31 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11124864
use warnings;
use Path::Tiny;
my %two = map /^(\d+)(.*)/, path('tmp02')->lines;
my $out = "PeptideID ProteinID SpectrumID Sequence\n";
s/^(\d+)(.*)/$1$2$two{$1}/ and $out .= $_ for path('tmp01')->lines;
path('tmp11_quick')->spew($out);
| [reply] [d/l] |
|
Hello tybalt89,
Thank yo so much for your great help!
It works, but still has a problem.
Thank you again and really appreciated!
Best,
Yue
Use of uninitialized value in concatenation (.) or string at match_qui
+ck03.pl line 9.
| [reply] [d/l] |
|
| [reply] |
|
|
Re: match two files
by kcott (Archbishop) on Dec 10, 2020 at 06:41 UTC
|
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my ($in1, $in2, $out) = qw{tmp01 tmp02 tmp11_quick};
my (%data, @headings);
{
open my $fh, '<', $in1;
while (<$fh>) {
if ($. == 1) {
push @headings, split;
}
else {
my ($pep, $prot) = split;
push @{$data{$pep}}, $prot;
}
}
}
{
my $fmt = "%-9s %-9s %-10s %-8s\n";
open my $fh_in, '<', $in2;
open my $fh_out, '>', $out;
while (<$fh_in>) {
my ($id, @rest) = split;
if ($. == 1) {
printf $fh_out $fmt, @headings, @rest;
}
else {
for (@{$data{$id}}) {
printf $fh_out $fmt, $id, $_, @rest;
}
}
}
}
Output:
PeptideID ProteinID SpectrumID Sequence
6 109521 53663 KMGEGR
7 741 53663 KPPSGK
11 681 144492 NNDALR
11 780 144492 NNDALR
20 2352 15547 SPAKPK
27 1490 55547 LHKPPK
27 1491 55547 LHKPPK
27 1492 55547 LHKPPK
28 51996 55547 LFVGRK
29 1490 55504 LHKPPK
29 1491 55504 LHKPPK
29 1492 55504 LHKPPK
30 1490 55602 LHKPPK
30 1491 55602 LHKPPK
30 1492 55602 LHKPPK
Notes:
-
This code deals with real files, not in-memory files.
-
All I/O is performed in anonymous blocks.
Files are only open while needed.
Filehandles close automatically at the end of these blocks.
-
autodie removes the need to hand-craft your own I/O exception messages.
It also won't make mistakes like you have in a later post: file with I/O problem is tmp01_quick;
message refers to a different file, i.e. donor_82_01.csv. (You have three errors like that.)
-
When I copied your data to files on my system, the tabs became spaces.
I only needed split without arguments; you should continue to use split /\t+/.
You should also take note of chomp
used in GrandFather's code.
-
I've used printf to improve output formatting;
however, that may not be what you want.
You should also take a look at Text::CSV.
(Note, it runs faster if you also have Text::CSV_XS installed.)
| [reply] [d/l] [select] |
Re: match two files
by leszekdubiel (Scribe) on Dec 10, 2020 at 20:10 UTC
|
#!/usr/bin/perl
my @tmp1 =
map { chomp; [ split /\s+/, $_, -1 ] }
`cat tmp01`; # (*)
my %tmp2 =
map {
chomp; my @a = split /\s+/, $_, -1;
($a[0], \@a)
}
`cat tmp02`;
print join "\t"
@$_,
@{ $tmp2{$$_[0]} },
"\n"
for
@tmp1;
# (*) instead of `cat tmp01` it is more safe to:
#use Path::Tiny;
#path("tmp01")->lines_utf8({chomp => 1});
| [reply] [d/l] |
|
| [reply] |
|
I prefer to read whole files, and print them joined. All at once. I don't like algorithmic C-style flow: open file, read line by line, process line, print line...
One can use whichever solution is better for his situation -- `cat file` or Path::Tiny... I don't know how to slurp whole file other simpler way.
| [reply] |
|
Here is simple solution:
#!/usr/bin/perl
my @tmp1 =
map { chomp; [ split /\s+/, $_, -1 ] }
`cat tmp01`; # (*)
my %tmp2 =
map {
chomp; my @a = split /\s+/, $_, -1;
($a[0], \@a)
}
`cat tmp02`;
print join "\t"
@$_,
@{ $tmp2{$$_[0]} },
"\n"
for
@tmp1;
# (*) instead of `cat tmp01` it is more safe to:
#use Path::Tiny;
#path("tmp01")->lines_utf8({chomp => 1});
Congratulations, you have won not just one, but two Useless Use of Cat Awards!
Further more, you have prepared two shell injection vulnerabilities (The problem of "the" default shell), you artifically restricted the "solution" to Unix systems (Windows and many other systems have no cat), and you are limiting the sum of the input file sizes to significantly less than available RAM and swap by not reading line-by-line, but instead reading both input files all at once.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] |
|
|