variation on splitting a string into elements of an array
by Anonymous Monk on Mar 02, 2005 at 16:13 UTC
|
Most Wise and Esteemed Monks,
i have the following query:
Suppose i have the following in a text file:
atgcatccctttaat
The following line will break up the string in the file into an array where each element of the array consists of three characters of the string.(Where the string in the file has been captured in $line).
@triplets = unpack ("a3" x (length($line)/3), $line);
Now suppose i wish to change the reading frame. Consider the string once again:
atccatccctttaat
to which i obtained as array elements:
atc cat ccc ttt aat
But what if i also wish to obtain the following array elements from the same string(using a different reading frame):
tcc atc cct tta
(Discard the final at in this reading frame)
And a third reading frame should give me the following array:
cca tcc ctt taa
(Discard the final t in this reading frame)
Is it possible to obtain all the three different kinds of arrays i've outlined using unpack? Or would i have to turn to regex's.
| [reply] [d/l] [select] |
|
|
X will back up a byte. So:
my $line ='atgcatccctttaat';
my @trips = unpack('a3X2' x (length($line)-2), $line);
print join "\n", @trips;
yields:
atg
tgc
gca
cat
atc
tcc
ccc
cct
ctt
ttt
tta
taa
aat
Caution: Contents may have been coded under pressure.
| [reply] [d/l] [select] |
|
|
This uses the same technique to extract three frame sequences into separate arrays:
my $line ='atgcatccctttaat';
my @trips;
my $frame;
for (unpack('a3X2' x (length($line)-2), $line)) {
push @{$trips[$frame++]}, $_;
$frame %= 3;
}
use Data::Dumper;
print Dumper \@trips;
output:
$VAR1 = [
[
'atg',
'cat',
'ccc',
'ttt',
'aat'
],
[
'tgc',
'atc',
'cct',
'tta'
],
[
'gca',
'tcc',
'ctt',
'taa'
]
];
Caution: Contents may have been coded under pressure.
| [reply] [d/l] [select] |
|
|
my $line = 'atccatccctttaat';
my @triplets = unpack( "a3" x int( length($line) / 3 ), $line );
my @triplets2 = unpack( 'x' . "a3" x int( ( length($line) - 1 ) / 3 ),
+ $line );
my @triplets3 = unpack( 'xx' . "a3" x int( ( length($line) - 2 ) / 3 )
+, $line );
| [reply] [d/l] |
|
|
@triplets0 = $line =~ /(...)/g;
@triplets1 = substr($line, 1) =~ /(...)/g;
@triplets2 = substr($line, 2) =~ /(...)/g;
or even
push(@triplets, [ substr($line, $_) =~ /(...)/g ]) for 0..2;
Finally, a solution for arbitrary group sizes:
$group_size = 3;
push(@groups, [ substr($line, $_-1) =~ /(.{$group_size})/g ])
for 1..$group_size;
If you don't care to group them, remove the square brackets.
Test:
$line = 'atccatccctttaat';
push(@triplets, [ substr($line, $_) =~ /(.{$_})/g ]) for 0..2;
require Data::Dumper;
print(Data::Dumper::Dumper(\@triplets));
__END__
output
======
$VAR1 = [
[
'atc',
'cat',
'ccc',
'ttt',
'aat'
],
[
'tcc',
'atc',
'cct',
'tta'
],
[
'cca',
'tcc',
'ctt',
'taa'
]
];
| [reply] [d/l] [select] |
|
|
Note that if you're using a perl as recent as perl 5.8, you can simplify your initial unpack to:
@triplets = unpack ('(a3)*', $line);
So long as you're sure that $line has a length that is a multiple of 3. If you don't necessarily have that, you'll get trailing crud in the last element of @triplets:
# throw away trailing crud
pop @triplets if $triplets[-1] !~ /.../;
This leads to this answer for your original question (very similar to what's already been posted)
my $line = 'atccatccctttaat';
my @triplets = unpack( '(a3)*', $line);
my @triplets2 = unpack( 'x(a3)*', $line);
my @triplets3 = unpack( 'xx(a3)*', $line);
# throw away trailing crud
pop @triplets if $triplets[-1] !~ /.../;
pop @triplets2 if $triplets2[-1] !~ /.../;
pop @triplets3 if $triplets3[-1] !~ /.../;
If you want to start reading at some arbitrary point, you can do:
my @triplets = unpack("x${skip}(a3)*", $line);
pop @triplets if $triplets[-1] !~ /.../;
Of course, it might just be easier to replace $line with substr($line,$skip).
--
@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/;
map{y/X_/\n /;print}map{pop@$_}@/for@/
| [reply] [d/l] [select] |
|
|
Since you had repeating code, my preference is to put it in a loop. I'd also use length instead of /.../, it's computationally less expensive which I assume is important if you're parsing lots of sequences.
my $line = 'atccatccctttaat';
my %triplets;
for (0 .. 2) {
@{$triplets{$_}} = unpack(('x' x $_).'(a3)*',$line);
pop @{$triplets{$_}} if length($triplets{$_}->[-1]) != 3;
print "Offset $_: @{$triplets{$_}}\n";
}
| [reply] [d/l] |
|
|
|
|
I have the slight feeling I've seen exactly the same question before.
| [reply] |
|
|
1. I thank the monks for the wisdom and enlightenment.
2. For those in doubt, its not the same question: u'd know if u looked closely.
3. I am indeed Rashmun. (I am using a different computer, and was in a hurry, so didn't login.)
| [reply] |
|
|
A slightly longer but easily configurable code
use strict ;
use warnings ;
##updated according to suggestions
my $string="atccatccctttaat" ;
my $left_out=2 ;
my $template = "x$left_out" ;
$template .= "a3" x ((length($string)-$left_out)/3) ;
my @array=unpack($template, $string) ;
local $"="\n" ;
print "@array" ;
$left_out will contain the number of characters you want to skip from the starting.
Manav
| [reply] [d/l] |
|
|
my $template = "a" x $left_out ;
shift @array while($left_out--);
When you could easily use this one line instead: (and earlier responses had used the "x" template character, and you'd clearly read those earlier responses)
my $template = "x$left_out" ;
This also has the advantage that you don't trash $left_out, in case you need it later. Of course, if you're using a perl 5.8 or higher, I'd really recommend my solution instead.
--
@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/;
map{y/X_/\n /;print}map{pop@$_}@/for@/
| [reply] [d/l] [select] |
|
|
Re: splitting a sequence using unpack
by Anonymous Monk on Mar 01, 2005 at 11:46 UTC
|
The x operator has nothing to do with unpack. It creates strings (in scalar context). In this case, it's used to create as many 'a3's as needed. However, that's unnecessary - you could use "(a3)*" as first argument to unpack as well.
As for the format of unpack, it tells unpack what's in the string (second argument), and how to "unpack" it. And 'a' means, "an ASCII character". The '3' means, three of them. The star means, as many as needed. So, '(a3)*' means, split the following string into pieces of three.
Alternatively (and what I would have done, being more of a regex person than an unpack one), you could use a regex:
@triplets = $seq =~ /[actg]{3}/g;
| [reply] [d/l] [select] |
Re: splitting a sequence using unpack
by Corion (Patriarch) on Mar 01, 2005 at 11:49 UTC
|
The first argument to the pack and unpack function is not Perl code, but a string. And what is allowed in that string is defined by the documentation for the pack function, which you can find by typing perldoc -f pack in a console window, or by visiting pack, or pack maybe.
Stuff inside of quotes is not seen by Perl as Perl code, but only as a literal value. Just like the difference between the word "And" and the way you concatenate two sentences.
| [reply] |
Re: splitting a sequence using unpack
by davis (Vicar) on Mar 01, 2005 at 11:46 UTC
|
It's a little unclear what your question is, but if you're trying to split every 3 characters, what about:
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
while(<DATA>) {
chomp;
my @triplets = $_ =~ /(.{1,3})/gs;
print Dumper \@triplets;
}
__DATA__
atgcatccctttaat
tctgtctga
This splits using a regular expression, matching between 1 and 3 characters per "chunk", and it'll endeavour to match as many characters per chunk as possible.
davis
It wasn't easy to juggle a pregnant wife and a troubled child, but somehow I managed to fit in eight hours of TV a day.
| [reply] [d/l] |
Re: splitting a sequence using unpack
by manav (Scribe) on Mar 01, 2005 at 12:02 UTC
|
unpack accepts as its first argument a template. See
perldoc -f pack
on how to construct the template.
For this example, "a3" implies 3 ASCII characters. length($seq)/3 will return you how many such 3 character sets are there.
Hence, this template is equivalent to
"a3a3a3a3" as long as it is needed. This splits $seq into elements each 3 Ascii character wide and returns it as a list.
Manav | [reply] |
Re: splitting a sequence using unpack
by demerphq (Chancellor) on Mar 01, 2005 at 12:13 UTC
|
I think if you read the documentation for pack/unpack youd know all you need to know. (see perlfunc) But the brief version is: "a3" means extract the input as sequences of 3 ascii chars (ie make a string out of the first three bytes). the 'x' operator is the element/string multiplier. In this context it repeats "a3" a certain number of times. So what the overall effect is is to divide your input string into as many three character sequences as there are in the string (assuming the strings length is a multiple of 3). Try doing some one liners to see what is going on:
D:\Development>perl -e "print 'a3' x 10"
a3a3a3a3a3a3a3a3a3a3
This use of pack is quite fast for the job it does but has a disadvantage that the string must be stored twice in memory and the pack string must be created which will only be slightly shorter. If you only need to deal with each three character sequence at a time it may be better to do
while ($seq~=/\G(...)/sg) {
print "Got: $1\n";
}
| [reply] [d/l] [select] |
|
|
if it's necessary to have the triplets in an array, this should do it.
my $seq = "atgcatccctttaat";
my @triplets = $seq =~ /(...)/g;
print "$seq: ", join(",", @triplets), "\n";
and if the last "triplet" can have only one or thwo characters, just change the regex.
my @triplets = $seq =~ /(.{1,3})/g;
as i agree with demerphq that unpack should be faster than a regex, it's just far more readable what's going on by using regular expressions. tune your code with unpack if the script is too slow.
btw. if you don't know what a funktion does, just type perldoc -f unpack ;-) | [reply] [d/l] [select] |
|
|
| [reply] |
|
|
the string is originaly in $seq and is dublicated into the array @triplets. if you only want to print them, it's unnecessary to hold all triplets in memory but you can go through them one by one. besides that, storing triplets in an array should imho use more memory than the string stored in a scalar.
| [reply] |
|
|
|
|
|
Re: splitting a sequence using unpack
by insaniac (Friar) on Mar 01, 2005 at 12:16 UTC
|
ever thought about perldoc -f pack or perldoc -f unpack.. there you will find explanations for the mysterious a3 and b3 (and why they give a different result!)
also: your message isn't that fun to read.. try creating more sentences with a dot at the end... if your message is enjoyable to read, more ppl will reply
hth
--
to ask a question is a moment of shame
to remain ignorant is a lifelong shame
| [reply] [d/l] [select] |
|
|
also: your message isn't that fun to read.. try creating more sentences with a dot at the end
Yeah, pots, kettles, black. Your message contains ellipses where they are misplaced (and 2 out of three only have 2 instead of three dots), misses dots on places where one belongs, lacks capital letters, and uses incorrect abbreviations. You ought to clean up your own writing art before complaining about the acts of others.
| [reply] |
|
|
| [reply] |
|
|
|
|
|
|