Split Function - Positions

MiamiGenome has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Split Function - Positions by Enlil (Parson) on Jun 02, 2004 at 00:43 UTC
Perhaps something like this: `use strict; use warnings; use Data::Dumper; my @genes; my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; my @genes; while ( $screen =~ /([^X]+)/g ) { my $start = pos($screen) - length($1); push @genes, [$start, $1]; } print Dumper \@genes;` [download] -enlil	[reply] [d/l]
Re: Re: Split Function - Positions by duff (Parson) on Jun 02, 2004 at 03:01 UTC
That's written simpler as: `while ($screen =~ /([^X]+)/g) { push @genes, [ $-[0], $1 ]; }` [download] duff	[reply] [d/l]
Re: Split Function - Positions by dragonchild (Archbishop) on Jun 02, 2004 at 00:57 UTC
You don't want split. You want to use a regex. `use strict; my @genes; while ($screen =~ /([^X]+)/g) { push @genes, $1; print "$1 occurred at ", pos($screen) - length $1, $/; }` [download] The trick is that pos() returns the position that the matcher is currently at, after the current match occurred. So, if you want to find out where the match started, you subtract the length of the match from the place the match ended. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply] [d/l]
Re: Split Function - Positions by kvale (Monsignor) on Jun 02, 2004 at 00:44 UTC
One way to do this is to capture the junk along with the substrings: `#! /usr/bin/perl -w use strict; my (@genes, @positions); my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; if ($screen =~ /X/) { my @parts = split /(X+)/, $screen; my $pos = 0; foreach my $part (@parts) { if ($part !~ /X/) { push @genes, $part; push @positions, $pos; print "Gene $part is at position $pos\n"; } $pos += length $part; } my $genecounter = @genes; print "Number of components = $genecounter\n\n"; }` [download] -Mark	[reply] [d/l]
Re^2: Split Function - Positions by tkil (Monk) on Jun 02, 2004 at 04:51 UTC
One way to do this is to capture the junk along with the substrings This is a fine use for the special treatment of capturing parentheses in the first (regex) argument to `split`: `my @chunks; my @start_pos; my $pos = 0; foreach my $chunk ( split /(X+)/, $string ) { unless ( $chunk =~ /^X/ ) { push @start_pos, $pos; push @chunks, $chunk; } $pos += length $chunk; }` [download]	[reply] [d/l] [select]
Re: Split Function - Positions by meetraz (Hermit) on Jun 02, 2004 at 00:40 UTC
Will this do what you want? use strict; my @genes; my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; while ($screen =~ /(^\|(?<=X))([^X]+)/g) { print "Gene $2 found at position ", length($`), "\n"; } [download] Update: added $2 gene string.	[reply] [d/l]
Re: Split Function - Positions by kesterkester (Hermit) on Jun 02, 2004 at 00:47 UTC
The below might be good enough for your purposes: it splits up $screen into an array of single chars (using the split // idiom to split a string into each character), then loops through the resultant array to find non-X characters that are either 1) at the start of the array, and therefore a starting point, or 2) just after an X character, and therefore a starting point. `#! /usr/bin/perl -w use strict; my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; my @chars = split //, $screen; my @starts; foreach ( 0 .. scalar @chars - 1 ) { push @starts, $_ if $chars[$_] ne 'X' && ( 0 == $_ \|\| $chars[$_-1] eq 'X' ); } print "$screen\n@starts\n";` [download] Try checking out the bioperl modules, at http://search.cpan.org/dist/bioperl/, also. I'm not familiar with them, but they have a prepackaged routine for this.	[reply] [d/l]
Re^2: Split Function - Positions by stajich (Chaplain) on Jun 16, 2004 at 18:13 UTC
You might find Bio::Restriction::Analysis useful.	[reply]
Re: Split Function - Positions by tkil (Monk) on Jun 02, 2004 at 05:03 UTC
Essentially what I wish to obtain is the starting position for each member of the `@genes` array within the original `$screen` variable. I gave a solution using `split` in an earlier reply. I think I would prefer to solve this with a simple iterative regex, taking advantage of the `@-` array, which tracks the starting index of each sub-match: `my @genes; my @gene_pos; while ( $screen =~ m/([^X]+)/g ) { push @genes, $1; push @gene_pos, $-[1]; }` [download]	[reply] [d/l] [select]
Re: Split Function - Positions by BrowserUk (Patriarch) on Jun 02, 2004 at 01:09 UTC
Update: Ignore this! It is much slower. This might work out a little faster if performance is a issue--which it usually is with genome related stuff. `#! perl -slw use strict; my $re = '([^X]+)X' . '(?:([^X]+)X)?' x 100; $re = qr[$re]; my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT"; $screen =~ $re; print for @-[ 1 .. $#- ]; __END__ P:\test>test2 0 13 22 38 45 52` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re: Re: Split Function - Positions by duff (Parson) on Jun 02, 2004 at 03:10 UTC
It is much slower. Surely it must be so; look at the size of your RE! :-) duff	[reply]
Re: Re: Re: Split Function - Positions by BrowserUk (Patriarch) on Jun 02, 2004 at 03:28 UTC
Agreed. Though I had thought that by grabbing the matches using a standard `m[([^X]+)]g` first, I would know how big to make the big re. Then a second pass would populate `@-`. As it turn out, `push @posns, pos($screen) - length $1 while $screen =~ /([^X]+)/g;` [download] is substantially faster than `push @posns, $-[ 0 ] while $screen =~ m[([^X]+)]g;` [download] which surprised me. I'm not sure why that would be? My best guess is that `@-` uses tie-style magic, and isn't populated unless it is accessed rather than when the regex runs? Perhaps the captures are made in the form of LVALUE refs and `@- and @+` are derived from those if and when they are called for? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]