Rashmun has asked for the wisdom of the Perl Monks concerning the following question:

Salutations to the esteemed Monks here,
my problem is as follows:
Following is what my program looks like:

print "Enter the filename of file having codon data:\n"; $file = <STDIN>; chomp ($file); open (IN, $file) or die "Error opening $file: $!\n"; print "\n\n"; while ($seq = <IN>) { print "$seq\n"; @triplets = unpack ("a3" x (length($seq)/3, $seq); print "@triplets\n\n"; } print "\n\n";

Now, suppose if the text file "codons" (no quotes) contains the following:

atgcatccctttaat tctgtctga

Then the output of my program is something like this:

atgcatccctttaat atg cat ccc ttt aat tctgtctga tct gtc tga

Note that i am splitting a sequence into array elements, where each element of the array corresponds to three characters of the sequence and i am doing this without using any delimiter like , or : (so i can't use split).

My problem is that the unpack function is highly mysterious to me. The "x" for example is the cross operator i believe but i don't know what it does. The "a3" inside the parenthesis within unpack is compulsory it turns out; replace "a3" with "b3" and u get something weird as output. So could someone please shed some light on the unpack function and explain how it is operating here, in my program.

20050301 Janitored by Corion: Added formatting

Replies are listed 'Best First'.
variation on splitting a string into elements of an array
by Anonymous Monk on Mar 02, 2005 at 16:13 UTC
    Most Wise and Esteemed Monks, i have the following query: Suppose i have the following in a text file: atgcatccctttaat The following line will break up the string in the file into an array where each element of the array consists of three characters of the string.(Where the string in the file has been captured in $line).
    @triplets = unpack ("a3" x (length($line)/3), $line);
    Now suppose i wish to change the reading frame. Consider the string once again:
    atccatccctttaat
    to which i obtained as array elements:
    atc cat ccc ttt aat
    But what if i also wish to obtain the following array elements from the same string(using a different reading frame):
    tcc atc cct tta
    (Discard the final at in this reading frame) And a third reading frame should give me the following array:
    cca tcc ctt taa
    (Discard the final t in this reading frame) Is it possible to obtain all the three different kinds of arrays i've outlined using unpack? Or would i have to turn to regex's.
      X will back up a byte. So:
      my $line ='atgcatccctttaat'; my @trips = unpack('a3X2' x (length($line)-2), $line); print join "\n", @trips;
      yields:
      atg tgc gca cat atc tcc ccc cct ctt ttt tta taa aat

      Caution: Contents may have been coded under pressure.
        This uses the same technique to extract three frame sequences into separate arrays:
        my $line ='atgcatccctttaat'; my @trips; my $frame; for (unpack('a3X2' x (length($line)-2), $line)) { push @{$trips[$frame++]}, $_; $frame %= 3; } use Data::Dumper; print Dumper \@trips;
        output:
        $VAR1 = [ [ 'atg', 'cat', 'ccc', 'ttt', 'aat' ], [ 'tgc', 'atc', 'cct', 'tta' ], [ 'gca', 'tcc', 'ctt', 'taa' ] ];

        Caution: Contents may have been coded under pressure.
      my $line = 'atccatccctttaat'; my @triplets = unpack( "a3" x int( length($line) / 3 ), $line ); my @triplets2 = unpack( 'x' . "a3" x int( ( length($line) - 1 ) / 3 ), + $line ); my @triplets3 = unpack( 'xx' . "a3" x int( ( length($line) - 2 ) / 3 ) +, $line );
      Boris

      A variation that doesn't use unpack:

      @triplets0 = $line =~ /(...)/g; @triplets1 = substr($line, 1) =~ /(...)/g; @triplets2 = substr($line, 2) =~ /(...)/g;

      or even

      push(@triplets, [ substr($line, $_) =~ /(...)/g ]) for 0..2;

      Finally, a solution for arbitrary group sizes:

      $group_size = 3; push(@groups, [ substr($line, $_-1) =~ /(.{$group_size})/g ]) for 1..$group_size;

      If you don't care to group them, remove the square brackets.

      Test:

      $line = 'atccatccctttaat'; push(@triplets, [ substr($line, $_) =~ /(.{$_})/g ]) for 0..2; require Data::Dumper; print(Data::Dumper::Dumper(\@triplets)); __END__ output ====== $VAR1 = [ [ 'atc', 'cat', 'ccc', 'ttt', 'aat' ], [ 'tcc', 'atc', 'cct', 'tta' ], [ 'cca', 'tcc', 'ctt', 'taa' ] ];
      Note that if you're using a perl as recent as perl 5.8, you can simplify your initial unpack to:
      @triplets = unpack ('(a3)*', $line);
      So long as you're sure that $line has a length that is a multiple of 3. If you don't necessarily have that, you'll get trailing crud in the last element of @triplets:
      # throw away trailing crud pop @triplets if $triplets[-1] !~ /.../;
      This leads to this answer for your original question (very similar to what's already been posted)
      my $line = 'atccatccctttaat'; my @triplets = unpack( '(a3)*', $line); my @triplets2 = unpack( 'x(a3)*', $line); my @triplets3 = unpack( 'xx(a3)*', $line); # throw away trailing crud pop @triplets if $triplets[-1] !~ /.../; pop @triplets2 if $triplets2[-1] !~ /.../; pop @triplets3 if $triplets3[-1] !~ /.../;
      If you want to start reading at some arbitrary point, you can do:
      my @triplets = unpack("x${skip}(a3)*", $line); pop @triplets if $triplets[-1] !~ /.../;
      Of course, it might just be easier to replace $line with substr($line,$skip).
      -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
        Since you had repeating code, my preference is to put it in a loop. I'd also use length instead of /.../, it's computationally less expensive which I assume is important if you're parsing lots of sequences.
        my $line = 'atccatccctttaat'; my %triplets; for (0 .. 2) { @{$triplets{$_}} = unpack(('x' x $_).'(a3)*',$line); pop @{$triplets{$_}} if length($triplets{$_}->[-1]) != 3; print "Offset $_: @{$triplets{$_}}\n"; }
      I have the slight feeling I've seen exactly the same question before.


      holli, /regexed monk/
        1. I thank the monks for the wisdom and enlightenment. 2. For those in doubt, its not the same question: u'd know if u looked closely. 3. I am indeed Rashmun. (I am using a different computer, and was in a hurry, so didn't login.)
      A slightly longer but easily configurable code
      use strict ; use warnings ; ##updated according to suggestions my $string="atccatccctttaat" ; my $left_out=2 ; my $template = "x$left_out" ; $template .= "a3" x ((length($string)-$left_out)/3) ; my @array=unpack($template, $string) ; local $"="\n" ; print "@array" ;
      $left_out will contain the number of characters you want to skip from the starting.

      Manav
        Why these two lines:
        my $template = "a" x $left_out ; shift @array while($left_out--);
        When you could easily use this one line instead: (and earlier responses had used the "x" template character, and you'd clearly read those earlier responses)
        my $template = "x$left_out" ;
        This also has the advantage that you don't trash $left_out, in case you need it later. Of course, if you're using a perl 5.8 or higher, I'd really recommend my solution instead.
        -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re: splitting a sequence using unpack
by Anonymous Monk on Mar 01, 2005 at 11:46 UTC
    The x operator has nothing to do with unpack. It creates strings (in scalar context). In this case, it's used to create as many 'a3's as needed. However, that's unnecessary - you could use "(a3)*" as first argument to unpack as well.

    As for the format of unpack, it tells unpack what's in the string (second argument), and how to "unpack" it. And 'a' means, "an ASCII character". The '3' means, three of them. The star means, as many as needed. So, '(a3)*' means, split the following string into pieces of three.

    Alternatively (and what I would have done, being more of a regex person than an unpack one), you could use a regex:

    @triplets = $seq =~ /[actg]{3}/g;
Re: splitting a sequence using unpack
by Corion (Patriarch) on Mar 01, 2005 at 11:49 UTC

    The first argument to the pack and unpack function is not Perl code, but a string. And what is allowed in that string is defined by the documentation for the pack function, which you can find by typing perldoc -f pack in a console window, or by visiting pack, or pack maybe.

    Stuff inside of quotes is not seen by Perl as Perl code, but only as a literal value. Just like the difference between the word "And" and the way you concatenate two sentences.

Re: splitting a sequence using unpack
by davis (Vicar) on Mar 01, 2005 at 11:46 UTC
    It's a little unclear what your question is, but if you're trying to split every 3 characters, what about:
    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; while(<DATA>) { chomp; my @triplets = $_ =~ /(.{1,3})/gs; print Dumper \@triplets; } __DATA__ atgcatccctttaat tctgtctga
    This splits using a regular expression, matching between 1 and 3 characters per "chunk", and it'll endeavour to match as many characters per chunk as possible.

    davis
    It wasn't easy to juggle a pregnant wife and a troubled child, but somehow I managed to fit in eight hours of TV a day.
Re: splitting a sequence using unpack
by manav (Scribe) on Mar 01, 2005 at 12:02 UTC
    unpack accepts as its first argument a template.
    See perldoc -f pack on how to construct the template.

    For this example, "a3" implies 3 ASCII characters. length($seq)/3 will return you how many such 3 character sets are there.

    Hence, this template is equivalent to "a3a3a3a3" as long as it is needed. This splits $seq into elements each 3 Ascii character wide and returns it as a list.

    Manav
Re: splitting a sequence using unpack
by demerphq (Chancellor) on Mar 01, 2005 at 12:13 UTC

    I think if you read the documentation for pack/unpack youd know all you need to know. (see perlfunc) But the brief version is: "a3" means extract the input as sequences of 3 ascii chars (ie make a string out of the first three bytes). the 'x' operator is the element/string multiplier. In this context it repeats "a3" a certain number of times. So what the overall effect is is to divide your input string into as many three character sequences as there are in the string (assuming the strings length is a multiple of 3). Try doing some one liners to see what is going on:

    D:\Development>perl -e "print 'a3' x 10" a3a3a3a3a3a3a3a3a3a3

    This use of pack is quite fast for the job it does but has a disadvantage that the string must be stored twice in memory and the pack string must be created which will only be slightly shorter. If you only need to deal with each three character sequence at a time it may be better to do

    while ($seq~=/\G(...)/sg) { print "Got: $1\n"; }
    ---
    demerphq

      if it's necessary to have the triplets in an array, this should do it.
      my $seq = "atgcatccctttaat"; my @triplets = $seq =~ /(...)/g; print "$seq: ", join(",", @triplets), "\n";

      and if the last "triplet" can have only one or thwo characters, just change the regex.

      my @triplets = $seq =~ /(.{1,3})/g;

      as i agree with demerphq that unpack should be faster than a regex, it's just far more readable what's going on by using regular expressions. tune your code with unpack if the script is too slow.
      btw. if you don't know what a funktion does, just type perldoc -f unpack ;-)
      ...it does but has a disadvantage that the string must be stored twice in memory...

      Would you explain this please?


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        the string is originaly in $seq and is dublicated into the array @triplets. if you only want to print them, it's unnecessary to hold all triplets in memory but you can go through them one by one. besides that, storing triplets in an array should imho use more memory than the string stored in a scalar.
Re: splitting a sequence using unpack
by insaniac (Friar) on Mar 01, 2005 at 12:16 UTC
    ever thought about perldoc -f pack or perldoc -f unpack.. there you will find explanations for the mysterious a3 and b3 (and why they give a different result!)

    also: your message isn't that fun to read.. try creating more sentences with a dot at the end... if your message is enjoyable to read, more ppl will reply
    hth

    --
    to ask a question is a moment of shame
    to remain ignorant is a lifelong shame
      also: your message isn't that fun to read.. try creating more sentences with a dot at the end
      Yeah, pots, kettles, black. Your message contains ellipses where they are misplaced (and 2 out of three only have 2 instead of three dots), misses dots on places where one belongs, lacks capital letters, and uses incorrect abbreviations. You ought to clean up your own writing art before complaining about the acts of others.
        :-D
        I see I stepped someone's toes :-D
        At least I had the guts to use my own name! :-p

        btw: what are ellipses? I know the mathematical ones...

        --
        to ask a question is a moment of shame
        to remain ignorant is a lifelong shame