splitting a sequence using unpack

Rashmun has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
variation on splitting a string into elements of an array by Anonymous Monk on Mar 02, 2005 at 16:13 UTC
Most Wise and Esteemed Monks, i have the following query: Suppose i have the following in a text file: atgcatccctttaat The following line will break up the string in the file into an array where each element of the array consists of three characters of the string.(Where the string in the file has been captured in $line). `@triplets = unpack ("a3" x (length($line)/3), $line);` [download] Now suppose i wish to change the reading frame. Consider the string once again: `atccatccctttaat` [download] to which i obtained as array elements: `atc cat ccc ttt aat` [download] But what if i also wish to obtain the following array elements from the same string(using a different reading frame): `tcc atc cct tta` [download] (Discard the final at in this reading frame) And a third reading frame should give me the following array: `cca tcc ctt taa` [download] (Discard the final t in this reading frame) Is it possible to obtain all the three different kinds of arrays i've outlined using unpack? Or would i have to turn to regex's.	[reply] [d/l] [select]
Re: variation on splitting a string into elements of an array by Roy Johnson (Monsignor) on Mar 02, 2005 at 16:29 UTC
X will back up a byte. So: `my $line ='atgcatccctttaat'; my @trips = unpack('a3X2' x (length($line)-2), $line); print join "\n", @trips;` [download] yields: `atg tgc gca cat atc tcc ccc cct ctt ttt tta taa aat` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re^2: variation on splitting a string into elements of an array by Roy Johnson (Monsignor) on Mar 02, 2005 at 19:00 UTC
This uses the same technique to extract three frame sequences into separate arrays: `my $line ='atgcatccctttaat'; my @trips; my $frame; for (unpack('a3X2' x (length($line)-2), $line)) { push @{$trips[$frame++]}, $_; $frame %= 3; } use Data::Dumper; print Dumper \@trips;` [download] output: `$VAR1 = [ [ 'atg', 'cat', 'ccc', 'ttt', 'aat' ], [ 'tgc', 'atc', 'cct', 'tta' ], [ 'gca', 'tcc', 'ctt', 'taa' ] ];` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re: variation on splitting a string into elements of an array by borisz (Canon) on Mar 02, 2005 at 16:39 UTC
`my $line = 'atccatccctttaat'; my @triplets = unpack( "a3" x int( length($line) / 3 ), $line ); my @triplets2 = unpack( 'x' . "a3" x int( ( length($line) - 1 ) / 3 ), + $line ); my @triplets3 = unpack( 'xx' . "a3" x int( ( length($line) - 2 ) / 3 ) +, $line );` [download] Boris	[reply] [d/l]
Re: variation on splitting a string into elements of an array by ikegami (Patriarch) on Mar 02, 2005 at 18:15 UTC
A variation that doesn't use `unpack`: `@triplets0 = $line =~ /(...)/g; @triplets1 = substr($line, 1) =~ /(...)/g; @triplets2 = substr($line, 2) =~ /(...)/g;` [download] or even `push(@triplets, [ substr($line, $_) =~ /(...)/g ]) for 0..2;` [download] Finally, a solution for arbitrary group sizes: `$group_size = 3; push(@groups, [ substr($line, $_-1) =~ /(.{$group_size})/g ]) for 1..$group_size;` [download] If you don't care to group them, remove the square brackets. Test: `$line = 'atccatccctttaat'; push(@triplets, [ substr($line, $_) =~ /(.{$_})/g ]) for 0..2; require Data::Dumper; print(Data::Dumper::Dumper(\@triplets)); __END__ output ====== $VAR1 = [ [ 'atc', 'cat', 'ccc', 'ttt', 'aat' ], [ 'tcc', 'atc', 'cct', 'tta' ], [ 'cca', 'tcc', 'ctt', 'taa' ] ];` [download]	[reply] [d/l] [select]
Re: variation on splitting a string into elements of an array by fizbin (Chaplain) on Mar 02, 2005 at 16:59 UTC
Note that if you're using a perl as recent as perl 5.8, you can simplify your initial unpack to: `@triplets = unpack ('(a3)', $line);` [download] So long as you're sure that $line has a length that is a multiple of 3. If you don't necessarily have that, you'll get trailing crud in the last element of @triplets: `# throw away trailing crud pop @triplets if $triplets[-1] !~ /.../;` [download] This leads to this answer for your original question (very similar to what's already been posted) `my $line = 'atccatccctttaat'; my @triplets = unpack( '(a3)', $line); my @triplets2 = unpack( 'x(a3)', $line); my @triplets3 = unpack( 'xx(a3)', $line); # throw away trailing crud pop @triplets if $triplets[-1] !~ /.../; pop @triplets2 if $triplets2[-1] !~ /.../; pop @triplets3 if $triplets3[-1] !~ /.../;` [download] If you want to start reading at some arbitrary point, you can do: `my @triplets = unpack("x${skip}(a3)*", $line); pop @triplets if $triplets[-1] !~ /.../;` [download] Of course, it might just be easier to replace `$line` with `substr($line,$skip)`. `-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/` [download]	[reply] [d/l] [select]
Re^2: variation on splitting a string into elements of an array by bageler (Hermit) on Mar 02, 2005 at 17:49 UTC
Since you had repeating code, my preference is to put it in a loop. I'd also use length instead of /.../, it's computationally less expensive which I assume is important if you're parsing lots of sequences. `my $line = 'atccatccctttaat'; my %triplets; for (0 .. 2) { @{$triplets{$_}} = unpack(('x' x $_).'(a3)*',$line); pop @{$triplets{$_}} if length($triplets{$_}->[-1]) != 3; print "Offset $_: @{$triplets{$_}}\n"; }` [download]	[reply] [d/l]
Re^3: variation on splitting a string into elements of an array by fizbin (Chaplain) on Mar 02, 2005 at 18:11 UTC
Re: variation on splitting a string into elements of an array by holli (Abbot) on Mar 02, 2005 at 18:44 UTC
I have the slight feeling I've seen exactly the same question before. holli, /regexed monk/	[reply]
Re^2: variation on splitting a string into elements of an array by Anonymous Monk on Mar 03, 2005 at 00:59 UTC
1. I thank the monks for the wisdom and enlightenment. 2. For those in doubt, its not the same question: u'd know if u looked closely. 3. I am indeed Rashmun. (I am using a different computer, and was in a hurry, so didn't login.)	[reply]
Re: variation on splitting a string into elements of an array by manav (Scribe) on Mar 02, 2005 at 16:53 UTC
A slightly longer but easily configurable code `use strict ; use warnings ; ##updated according to suggestions my $string="atccatccctttaat" ; my $left_out=2 ; my $template = "x$left_out" ; $template .= "a3" x ((length($string)-$left_out)/3) ; my @array=unpack($template, $string) ; local $"="\n" ; print "@array" ;` [download] $left_out will contain the number of characters you want to skip from the starting. Manav	[reply] [d/l]
Re^2: variation on splitting a string into elements of an array by fizbin (Chaplain) on Mar 02, 2005 at 17:07 UTC
Why these two lines: `my $template = "a" x $left_out ; shift @array while($left_out--);` [download] When you could easily use this one line instead: (and earlier responses had used the "x" template character, and you'd clearly read those earlier responses) `my $template = "x$left_out" ;` [download] This also has the advantage that you don't trash `$left_out`, in case you need it later. Of course, if you're using a perl 5.8 or higher, I'd really recommend my solution instead. `-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/` [download]	[reply] [d/l] [select]
Re^3: variation on splitting a string into elements of an array by manav (Scribe) on Mar 02, 2005 at 17:18 UTC
Re: splitting a sequence using unpack by Anonymous Monk on Mar 01, 2005 at 11:46 UTC
The `x` operator has nothing to do with unpack. It creates strings (in scalar context). In this case, it's used to create as many 'a3's as needed. However, that's unnecessary - you could use "(a3)" as first argument to unpack as well. As for the format of unpack, it tells unpack what's in the string (second argument), and how to "unpack" it. And 'a' means, "an ASCII character". The '3' means, three of them. The star means, as many as needed. So, '(a3)' means, split the following string into pieces of three. Alternatively (and what I would have done, being more of a regex person than an unpack one), you could use a regex: `@triplets = $seq =~ /[actg]{3}/g;` [download]	[reply] [d/l] [select]
Re: splitting a sequence using unpack by Corion (Patriarch) on Mar 01, 2005 at 11:49 UTC
The first argument to the `pack` and `unpack` function is not Perl code, but a string. And what is allowed in that string is defined by the documentation for the `pack` function, which you can find by typing `perldoc -f pack` in a console window, or by visiting pack, or pack maybe. Stuff inside of quotes is not seen by Perl as Perl code, but only as a literal value. Just like the difference between the word "And" and the way you concatenate two sentences.	[reply]
Re: splitting a sequence using unpack by davis (Vicar) on Mar 01, 2005 at 11:46 UTC
It's a little unclear what your question is, but if you're trying to split every 3 characters, what about: `#!/usr/bin/perl use warnings; use strict; use Data::Dumper; while(<DATA>) { chomp; my @triplets = $_ =~ /(.{1,3})/gs; print Dumper \@triplets; } __DATA__ atgcatccctttaat tctgtctga` [download] This splits using a regular expression, matching between 1 and 3 characters per "chunk", and it'll endeavour to match as many characters per chunk as possible. davis It wasn't easy to juggle a pregnant wife and a troubled child, but somehow I managed to fit in eight hours of TV a day.	[reply] [d/l]
Re: splitting a sequence using unpack by manav (Scribe) on Mar 01, 2005 at 12:02 UTC
unpack accepts as its first argument a template. See perldoc -f pack on how to construct the template. For this example, "a3" implies 3 ASCII characters. length($seq)/3 will return you how many such 3 character sets are there. Hence, this template is equivalent to "a3a3a3a3" as long as it is needed. This splits $seq into elements each 3 Ascii character wide and returns it as a list. Manav	[reply]
Re: splitting a sequence using unpack by demerphq (Chancellor) on Mar 01, 2005 at 12:13 UTC
I think if you read the documentation for pack/unpack youd know all you need to know. (see perlfunc) But the brief version is: "a3" means extract the input as sequences of 3 ascii chars (ie make a string out of the first three bytes). the 'x' operator is the element/string multiplier. In this context it repeats "a3" a certain number of times. So what the overall effect is is to divide your input string into as many three character sequences as there are in the string (assuming the strings length is a multiple of 3). Try doing some one liners to see what is going on: `D:\Development>perl -e "print 'a3' x 10" a3a3a3a3a3a3a3a3a3a3` [download] This use of pack is quite fast for the job it does but has a disadvantage that the string must be stored twice in memory and the pack string must be created which will only be slightly shorter. If you only need to deal with each three character sequence at a time it may be better to do `while ($seq~=/\G(...)/sg) { print "Got: $1\n"; }` [download] --- demerphq	[reply] [d/l] [select]
Re^2: splitting a sequence using unpack by Taulmarill (Deacon) on Mar 01, 2005 at 12:23 UTC
if it's necessary to have the triplets in an array, this should do it. `my $seq = "atgcatccctttaat"; my @triplets = $seq =~ /(...)/g; print "$seq: ", join(",", @triplets), "\n";` [download] and if the last "triplet" can have only one or thwo characters, just change the regex. `my @triplets = $seq =~ /(.{1,3})/g;` as i agree with demerphq that unpack should be faster than a regex, it's just far more readable what's going on by using regular expressions. tune your code with unpack if the script is too slow. btw. if you don't know what a funktion does, just type `perldoc -f unpack` ;-)	[reply] [d/l] [select]
Re^2: splitting a sequence using unpack by BrowserUk (Patriarch) on Mar 01, 2005 at 12:25 UTC
...it does but has a disadvantage that the string must be stored twice in memory... Would you explain this please? Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply]
Re^3: splitting a sequence using unpack by Taulmarill (Deacon) on Mar 01, 2005 at 12:33 UTC
the string is originaly in $seq and is dublicated into the array @triplets. if you only want to print them, it's unnecessary to hold all triplets in memory but you can go through them one by one. besides that, storing triplets in an array should imho use more memory than the string stored in a scalar.	[reply]
Re^4: splitting a sequence using unpack by BrowserUk (Patriarch) on Mar 01, 2005 at 12:47 UTC
Re^5: splitting a sequence using unpack by Anonymous Monk on Mar 01, 2005 at 13:05 UTC
Some notes below your chosen depth have not been shown here
Re: splitting a sequence using unpack by insaniac (Friar) on Mar 01, 2005 at 12:16 UTC
ever thought about `perldoc -f pack` or `perldoc -f unpack`.. there you will find explanations for the mysterious a3 and b3 (and why they give a different result!) also: your message isn't that fun to read.. try creating more sentences with a dot at the end... if your message is enjoyable to read, more ppl will reply hth -- to ask a question is a moment of shame to remain ignorant is a lifelong shame	[reply] [d/l] [select]
Re^2: splitting a sequence using unpack by Anonymous Monk on Mar 01, 2005 at 13:11 UTC
also: your message isn't that fun to read.. try creating more sentences with a dot at the end Yeah, pots, kettles, black. Your message contains ellipses where they are misplaced (and 2 out of three only have 2 instead of three dots), misses dots on places where one belongs, lacks capital letters, and uses incorrect abbreviations. You ought to clean up your own writing art before complaining about the acts of others.	[reply]
Re^3: splitting a sequence using unpack by insaniac (Friar) on Mar 01, 2005 at 14:06 UTC
:-D I see I stepped someone's toes :-D At least I had the guts to use my own name! :-p btw: what are ellipses? I know the mathematical ones... -- to ask a question is a moment of shame to remain ignorant is a lifelong shame	[reply]
Re^4: splitting a sequence using unpack by gellyfish (Monsignor) on Mar 01, 2005 at 14:56 UTC
Re^4: splitting a sequence using unpack by Anonymous Monk on Mar 02, 2005 at 09:41 UTC
Re^5: splitting a sequence using unpack by insaniac (Friar) on Mar 02, 2005 at 10:37 UTC