Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hello Monks!
I need to do the following:
In a document with the following format:
>id1|various annotation1
sequence of characters1
>id2|various annotation2
sequence of characters2
...
>idN|various annotationN
sequence of charactersN
I want to group the lines that have the same various annotation and sequence but not the same id into one line, like this:
>id1-id15-id22|various annotation1
sequence of characters1
...
When the various annotation is the same, then the sequence is also the same.
I am pretty sure I need to use the map command, but I am a bit unsure how I would handle this. So far I have used it when I have 2 things to compare, e.g. only the id that would change and the sequence that could be repeated, and I was grouping based on the sequence. Like this:
use strict;
use warnings;
my %res;
while (<>)
{
chomp;
my ( $name, $rest ) = split /\t/;
push @{ $res{$name} }, $rest;
}
for ( sort keys %res )
{
print "$_:", join( ",", @{ $res{$_} } );
print "\n";
}
But now, I have 3 things instead.
Re: How do I use the map command for this?
by haukex (Archbishop) on Jun 19, 2022 at 15:41 UTC
|
It looks like you've got a FASTA format file. There are lots of modules on the CPAN to handle that file format, for example the BioPerl suite is frequently recommended. There are various nodes on this site that show the use of that module.
Other than that, as has already been said, please see How do I post a question effectively? and Short, Self-Contained, Correct Example - in this case, your question is missing representative sample input and the expected output for that input, which means that anyone wishing to help you has to make some up, and that may not be representative of your actual input. Update: I'm going to assume that this node shows representative input, despite the fact that it doesn't compile.
| [reply] |
Re: How do I use the map command for this?
by kcott (Archbishop) on Jun 20, 2022 at 08:32 UTC
|
"How do I use the map command for this?"
No idea. :-)
"I am pretty sure I need to use the map command"
Why?
In the code below, I simplified the data so that it's easier to see what's happening.
I've also jumbled it up to demonstrate the two sorts — you may not need or want either of these
(biological data can be huge, so don't do any sorting unless it's absolutely necessary).
#!/usr/bin/env perl
use strict;
use warnings;
my %res;
{
local $/ = "\n>";
while (my $record = <DATA>) {
chomp $record;
$record = substr $record, 1 if $. == 1;
$record =~ s/\s*\z//m;
my ($id, $anno_seq) = split /\|/, $record, 2;
push @{$res{$anno_seq}}, $id;
}
}
print '>', join('|', (sort @{$res{$_}}), $_), "\n"
for sort { $res{$a}[0] cmp $res{$b}[0] } keys %res;
__DATA__
>id7|anno2
GHI
>id1|anno1
ABC
>id6|anno1
ABC
>id4|anno2
GHI
>id2|anno3
DEF
>id3|anno1
ABC
>id5|anno1
ABC
Output:
>id1|id3|id5|id6|anno1
ABC
>id2|anno3
DEF
>id4|id7|anno2
GHI
| [reply] [d/l] [select] |
Re: How do I use the map command for this?
by tybalt89 (Monsignor) on Jun 19, 2022 at 17:13 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11144847
use warnings;
my %res;
/(id\d+)(.*)/s and push @{ $res{$2} }, $1 for split />/, join '', <DAT
+A>;
for ( sort keys %res )
{
local $" = '|';
print ">@{$res{$_}}$_";
}
__DATA__
>id1|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
>id2|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
>id3|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
>id4|Q51487|P-474-4|86-98,113-126,297-310,322-335 alt
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
>id5|Q51487|P-474-4|86-98,113-126,297-310,322-335 alt
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
Outputs:
>id1|id2|id3|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
>id4|id5|Q51487|P-474-4|86-98,113-126,297-310,322-335 alt
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALEc>
| [reply] [d/l] [select] |
Re: How do I use the map command for this?
by LanX (Saint) on Jun 19, 2022 at 13:33 UTC
|
Sorry, I have problems to understand your question ...
Suggestions:
- get an account here, so you can edit and fix your postings
- provide us with an SSCCE with clear input and expected output
| [reply] |
|
Yes, probably it was not well-written. Imagine this script:
use strict;
use warnings;
my %res;
my $id='';
my $rest='';
my $seq='';
while (<DATA>)
{
chomp;
if($_=~/^>(.*?)\|(.*)/)
{
$id=$1;
$rest=$2;
$seq=<DATA>;
chomp $seq;
}
}
_DATA_
>id1|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
>id2|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
>id3|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
and I want to go to:
>id1|id2|id3|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
And I wonder if I can use some of Perl's data structures, or the map command? | [reply] [d/l] [select] |
|
use strict;
use warnings;
use Test::More tests => 1;
my $want = <<EOT;
>id1|id2|id3|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
EOT
my %res;
while (<DATA>) {
if (/^>(.*?)\|(.*)/s) {
my $id = $1;
my $rest = $2;
if (exists $res{$rest}) {
$res{$rest}{keys} .= "|$id";
} else {
my $seq = <DATA>;
$res{$rest} = {
keys => ">$id",
seq => $seq,
}
}
}
}
my ($k) = keys %res; # You would loop over all keys is your real prog
my $have = "$res{$k}{keys}|$k$res{$k}{seq}";
is $have, $want;
__DATA__
>id1|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
>id2|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
>id3|Q51487|P-474-4|86-98,113-126,297-310,322-335
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
| [reply] [d/l] |
|
>
_DATA_
I can tell immediately that this doesn't run! 👎
You should check before posting, that's the "Correct" part in SSCCE!
And it's not clear to me which part has to be unique...
This
|Q51487|P-474-4|86-98,113-126,297-310,322-335
or this
CSLIPDYQRPEAPVAAAYPQGQAYGQNTGAAAVPAADIGWREFFRDPQLQQLIGVALE
or both.
This has not much to do with map, a HoA = Hash of Arrays with unique keys (well which one?) is enough.
Or probably a HoHoA depending on your perception of unique.
And of course also depending on your desired output order.
Please show more effort!
| [reply] [d/l] [select] |
|
I had this older script that was doing something similar, but with two values separated by tab:
use strict;
use warnings;
my %res;
while (<>)
{
chomp;
my ( $name, $rest ) = split /\t/;
push @{ $res{$name} }, $rest;
}
for ( sort keys %res )
{
print "$_:", join( ",", @{ $res{$_} } );
print "\n";
}
but I do not know how to adapt it. | [reply] [d/l] |
|
Re: How do I use the map command for this?
by Marshall (Canon) on Jun 20, 2022 at 23:15 UTC
|
"How do I use the map command for this?"
"I am pretty sure I need to use the map command"
The map statement is just a short-hand way of writing a simple "for" loop.
ALL map's can be expressed as "for" (foreach) loops.
Perl generates the same code for a "map" as the equivalent "foreach" loop.
map{} is not your issue.
I think looking at Bio::Perl will lead to some suggestions for you. There are many routines optimized for the FASTA format.
| [reply] [d/l] |
|
Perl generates the same code for a "map" as the equivalent "foreach" loop.
wrong
| [reply] |
|
Would certainly enjoy being educated. I probably should have said "similar". A map is a looping statement.
| [reply] |
|
|
|
|
|
|
|