MiamiGenome has asked for the wisdom of the Perl Monks concerning the following question:

Greetings PMs,

I have a series of strings which I wish to cleave on a given identifier ('X') so I am using the split function.
However I can not determine how to get the positions of the extracted substrings:

#! /usr/bin/perl -w use strict; my @genes; my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; if ($screen =~ /X/) { @genes = split /X+/, $screen; my $genecounter = @genes; print "Number of components = $genecounter\n\n"; }
Essentially what I wish to obtain is the starting position for each member of the @genes array within the original $screen variable.

Sometimes the members of the @gene array are identical to eachother, so I'm not sure how to utilize the substr function.
Many thanks for any tips!

Replies are listed 'Best First'.
Re: Split Function - Positions
by Enlil (Parson) on Jun 02, 2004 at 00:43 UTC
    Perhaps something like this:
    use strict; use warnings; use Data::Dumper; my @genes; my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; my @genes; while ( $screen =~ /([^X]+)/g ) { my $start = pos($screen) - length($1); push @genes, [$start, $1]; } print Dumper \@genes;

    -enlil

      That's written simpler as:
      while ($screen =~ /([^X]+)/g) { push @genes, [ $-[0], $1 ]; }
Re: Split Function - Positions
by dragonchild (Archbishop) on Jun 02, 2004 at 00:57 UTC
    You don't want split. You want to use a regex.
    use strict; my @genes; while ($screen =~ /([^X]+)/g) { push @genes, $1; print "$1 occurred at ", pos($screen) - length $1, $/; }

    The trick is that pos() returns the position that the matcher is currently at, after the current match occurred. So, if you want to find out where the match started, you subtract the length of the match from the place the match ended.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: Split Function - Positions
by kvale (Monsignor) on Jun 02, 2004 at 00:44 UTC
    One way to do this is to capture the junk along with the substrings:
    #! /usr/bin/perl -w use strict; my (@genes, @positions); my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; if ($screen =~ /X/) { my @parts = split /(X+)/, $screen; my $pos = 0; foreach my $part (@parts) { if ($part !~ /X/) { push @genes, $part; push @positions, $pos; print "Gene $part is at position $pos\n"; } $pos += length $part; } my $genecounter = @genes; print "Number of components = $genecounter\n\n"; }

    -Mark

      One way to do this is to capture the junk along with the substrings

      This is a fine use for the special treatment of capturing parentheses in the first (regex) argument to split:

      my @chunks; my @start_pos; my $pos = 0; foreach my $chunk ( split /(X+)/, $string ) { unless ( $chunk =~ /^X/ ) { push @start_pos, $pos; push @chunks, $chunk; } $pos += length $chunk; }
Re: Split Function - Positions
by meetraz (Hermit) on Jun 02, 2004 at 00:40 UTC
    Will this do what you want?

    use strict; my @genes; my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; while ($screen =~ /(^|(?<=X))([^X]+)/g) { print "Gene $2 found at position ", length($`), "\n"; }

    Update: added $2 gene string.

Re: Split Function - Positions
by kesterkester (Hermit) on Jun 02, 2004 at 00:47 UTC
    The below might be good enough for your purposes:
    it splits up $screen into an array of single chars (using the split // idiom to split a string into each character), then loops through the resultant array to find non-X characters that are either
    • 1) at the start of the array, and therefore a starting point, or
    • 2) just after an X character, and therefore a starting point.

    #! /usr/bin/perl -w use strict; my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; my @chars = split //, $screen; my @starts; foreach ( 0 .. scalar @chars - 1 ) { push @starts, $_ if $chars[$_] ne 'X' && ( 0 == $_ || $chars[$_-1] eq 'X' ); } print "$screen\n@starts\n";

    Try checking out the bioperl modules, at http://search.cpan.org/dist/bioperl/, also. I'm not familiar with them, but they have a prepackaged routine for this.

Re: Split Function - Positions
by tkil (Monk) on Jun 02, 2004 at 05:03 UTC
    Essentially what I wish to obtain is the starting position for each member of the @genes array within the original $screen variable.

    I gave a solution using split in an earlier reply. I think I would prefer to solve this with a simple iterative regex, taking advantage of the @- array, which tracks the starting index of each sub-match:

    my @genes; my @gene_pos; while ( $screen =~ m/([^X]+)/g ) { push @genes, $1; push @gene_pos, $-[1]; }
Re: Split Function - Positions
by BrowserUk (Patriarch) on Jun 02, 2004 at 01:09 UTC

    Update: Ignore this! It is much slower.

    This might work out a little faster if performance is a issue--which it usually is with genome related stuff.

    #! perl -slw use strict; my $re = '([^X]+)X*' . '(?:([^X]+)X*)?' x 100; $re = qr[$re]; my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; $screen =~ $re; print for @-[ 1 .. $#- ]; __END__ P:\test>test2 0 13 22 38 45 52

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      It is much slower.

      Surely it must be so; look at the size of your RE! :-)

        Agreed. Though I had thought that by grabbing the matches using a standard m[([^X]+)]g first, I would know how big to make the big re. Then a second pass would populate @-.

        As it turn out,

        push @posns, pos($screen) - length $1 while $screen =~ /([^X]+)/g;

        is substantially faster than

        push @posns, $-[ 0 ] while $screen =~ m[([^X]+)]g;

        which surprised me. I'm not sure why that would be?

        My best guess is that @- uses tie-style magic, and isn't populated unless it is accessed rather than when the regex runs? Perhaps the captures are made in the form of LVALUE refs and @- and @+ are derived from those if and when they are called for?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail