in reply to splitting a sequence using unpack

Most Wise and Esteemed Monks, i have the following query: Suppose i have the following in a text file: atgcatccctttaat The following line will break up the string in the file into an array where each element of the array consists of three characters of the string.(Where the string in the file has been captured in $line).
@triplets = unpack ("a3" x (length($line)/3), $line);
Now suppose i wish to change the reading frame. Consider the string once again:
atccatccctttaat
to which i obtained as array elements:
atc cat ccc ttt aat
But what if i also wish to obtain the following array elements from the same string(using a different reading frame):
tcc atc cct tta
(Discard the final at in this reading frame) And a third reading frame should give me the following array:
cca tcc ctt taa
(Discard the final t in this reading frame) Is it possible to obtain all the three different kinds of arrays i've outlined using unpack? Or would i have to turn to regex's.

Replies are listed 'Best First'.
Re: variation on splitting a string into elements of an array
by Roy Johnson (Monsignor) on Mar 02, 2005 at 16:29 UTC
    X will back up a byte. So:
    my $line ='atgcatccctttaat'; my @trips = unpack('a3X2' x (length($line)-2), $line); print join "\n", @trips;
    yields:
    atg tgc gca cat atc tcc ccc cct ctt ttt tta taa aat

    Caution: Contents may have been coded under pressure.
      This uses the same technique to extract three frame sequences into separate arrays:
      my $line ='atgcatccctttaat'; my @trips; my $frame; for (unpack('a3X2' x (length($line)-2), $line)) { push @{$trips[$frame++]}, $_; $frame %= 3; } use Data::Dumper; print Dumper \@trips;
      output:
      $VAR1 = [ [ 'atg', 'cat', 'ccc', 'ttt', 'aat' ], [ 'tgc', 'atc', 'cct', 'tta' ], [ 'gca', 'tcc', 'ctt', 'taa' ] ];

      Caution: Contents may have been coded under pressure.
Re: variation on splitting a string into elements of an array
by borisz (Canon) on Mar 02, 2005 at 16:39 UTC
    my $line = 'atccatccctttaat'; my @triplets = unpack( "a3" x int( length($line) / 3 ), $line ); my @triplets2 = unpack( 'x' . "a3" x int( ( length($line) - 1 ) / 3 ), + $line ); my @triplets3 = unpack( 'xx' . "a3" x int( ( length($line) - 2 ) / 3 ) +, $line );
    Boris
Re: variation on splitting a string into elements of an array
by ikegami (Patriarch) on Mar 02, 2005 at 18:15 UTC

    A variation that doesn't use unpack:

    @triplets0 = $line =~ /(...)/g; @triplets1 = substr($line, 1) =~ /(...)/g; @triplets2 = substr($line, 2) =~ /(...)/g;

    or even

    push(@triplets, [ substr($line, $_) =~ /(...)/g ]) for 0..2;

    Finally, a solution for arbitrary group sizes:

    $group_size = 3; push(@groups, [ substr($line, $_-1) =~ /(.{$group_size})/g ]) for 1..$group_size;

    If you don't care to group them, remove the square brackets.

    Test:

    $line = 'atccatccctttaat'; push(@triplets, [ substr($line, $_) =~ /(.{$_})/g ]) for 0..2; require Data::Dumper; print(Data::Dumper::Dumper(\@triplets)); __END__ output ====== $VAR1 = [ [ 'atc', 'cat', 'ccc', 'ttt', 'aat' ], [ 'tcc', 'atc', 'cct', 'tta' ], [ 'cca', 'tcc', 'ctt', 'taa' ] ];
Re: variation on splitting a string into elements of an array
by fizbin (Chaplain) on Mar 02, 2005 at 16:59 UTC
    Note that if you're using a perl as recent as perl 5.8, you can simplify your initial unpack to:
    @triplets = unpack ('(a3)*', $line);
    So long as you're sure that $line has a length that is a multiple of 3. If you don't necessarily have that, you'll get trailing crud in the last element of @triplets:
    # throw away trailing crud pop @triplets if $triplets[-1] !~ /.../;
    This leads to this answer for your original question (very similar to what's already been posted)
    my $line = 'atccatccctttaat'; my @triplets = unpack( '(a3)*', $line); my @triplets2 = unpack( 'x(a3)*', $line); my @triplets3 = unpack( 'xx(a3)*', $line); # throw away trailing crud pop @triplets if $triplets[-1] !~ /.../; pop @triplets2 if $triplets2[-1] !~ /.../; pop @triplets3 if $triplets3[-1] !~ /.../;
    If you want to start reading at some arbitrary point, you can do:
    my @triplets = unpack("x${skip}(a3)*", $line); pop @triplets if $triplets[-1] !~ /.../;
    Of course, it might just be easier to replace $line with substr($line,$skip).
    -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
      Since you had repeating code, my preference is to put it in a loop. I'd also use length instead of /.../, it's computationally less expensive which I assume is important if you're parsing lots of sequences.
      my $line = 'atccatccctttaat'; my %triplets; for (0 .. 2) { @{$triplets{$_}} = unpack(('x' x $_).'(a3)*',$line); pop @{$triplets{$_}} if length($triplets{$_}->[-1]) != 3; print "Offset $_: @{$triplets{$_}}\n"; }
        Yeah, I thought about a loop, but I was afraid it would obscure what the code is doing - I'd certainly switch to a loop if doing more than 3 or four repetitions, and if I were doing more than two statements per offset, but at less than that the loop syntax just clutters things up. (I realize that this is a matter of personal taste, and I might change my answer depending on my mood).

        Good point about using length - again, I think my way is clearer, but I'm not sure whether that's the use of length or the ->. (which you could drop)

        By the way, I would have written the first line of your loop as:

        $triplets{$_} = [unpack("x$_ (a3)*", $line)];
        But that's only because I don't like using @{$unusedvarref} to auto-vivify.
        -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re: variation on splitting a string into elements of an array
by holli (Abbot) on Mar 02, 2005 at 18:44 UTC
    I have the slight feeling I've seen exactly the same question before.


    holli, /regexed monk/
      1. I thank the monks for the wisdom and enlightenment. 2. For those in doubt, its not the same question: u'd know if u looked closely. 3. I am indeed Rashmun. (I am using a different computer, and was in a hurry, so didn't login.)
Re: variation on splitting a string into elements of an array
by manav (Scribe) on Mar 02, 2005 at 16:53 UTC
    A slightly longer but easily configurable code
    use strict ; use warnings ; ##updated according to suggestions my $string="atccatccctttaat" ; my $left_out=2 ; my $template = "x$left_out" ; $template .= "a3" x ((length($string)-$left_out)/3) ; my @array=unpack($template, $string) ; local $"="\n" ; print "@array" ;
    $left_out will contain the number of characters you want to skip from the starting.

    Manav
      Why these two lines:
      my $template = "a" x $left_out ; shift @array while($left_out--);
      When you could easily use this one line instead: (and earlier responses had used the "x" template character, and you'd clearly read those earlier responses)
      my $template = "x$left_out" ;
      This also has the advantage that you don't trash $left_out, in case you need it later. Of course, if you're using a perl 5.8 or higher, I'd really recommend my solution instead.
      -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
        You are right on about the use of 'x'. btw, I didnt read the earlier comments as they were not there when I answered.

        Manav