Re: Split Function - Positions
by Enlil (Parson) on Jun 02, 2004 at 00:43 UTC
|
Perhaps something like this:
use strict;
use warnings;
use Data::Dumper;
my @genes;
my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT";
my @genes;
while ( $screen =~ /([^X]+)/g ) {
my $start = pos($screen) - length($1);
push @genes, [$start, $1];
}
print Dumper \@genes;
-enlil
| [reply] [d/l] |
|
|
That's written simpler as:
while ($screen =~ /([^X]+)/g) {
push @genes, [ $-[0], $1 ];
}
| [reply] [d/l] |
Re: Split Function - Positions
by dragonchild (Archbishop) on Jun 02, 2004 at 00:57 UTC
|
You don't want split. You want to use a regex.
use strict;
my @genes;
while ($screen =~ /([^X]+)/g)
{
push @genes, $1;
print "$1 occurred at ", pos($screen) - length $1, $/;
}
The trick is that pos() returns the position that the matcher is currently at, after the current match occurred. So, if you want to find out where the match started, you subtract the length of the match from the place the match ended.
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
I shouldn't have to say this, but any code, unless otherwise stated, is untested
| [reply] [d/l] |
Re: Split Function - Positions
by kvale (Monsignor) on Jun 02, 2004 at 00:44 UTC
|
One way to do this is to capture the junk along with the substrings:
#! /usr/bin/perl -w
use strict;
my (@genes, @positions);
my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT";
if ($screen =~ /X/) {
my @parts = split /(X+)/, $screen;
my $pos = 0;
foreach my $part (@parts) {
if ($part !~ /X/) {
push @genes, $part;
push @positions, $pos;
print "Gene $part is at position $pos\n";
}
$pos += length $part;
}
my $genecounter = @genes;
print "Number of components = $genecounter\n\n";
}
| [reply] [d/l] |
|
|
my @chunks;
my @start_pos;
my $pos = 0;
foreach my $chunk ( split /(X+)/, $string )
{
unless ( $chunk =~ /^X/ )
{
push @start_pos, $pos;
push @chunks, $chunk;
}
$pos += length $chunk;
}
| [reply] [d/l] [select] |
Re: Split Function - Positions
by meetraz (Hermit) on Jun 02, 2004 at 00:40 UTC
|
Will this do what you want?
use strict;
my @genes;
my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT";
while ($screen =~ /(^|(?<=X))([^X]+)/g) {
print "Gene $2 found at position ", length($`), "\n";
}
Update: added $2 gene string. | [reply] [d/l] |
Re: Split Function - Positions
by kesterkester (Hermit) on Jun 02, 2004 at 00:47 UTC
|
The below might be good enough for your purposes:
it splits up $screen into an array of single chars (using the split // idiom to split a string into each character), then loops through the resultant array to find non-X characters that are either
- 1) at the start of the array, and therefore a starting point, or
- 2) just after an X character, and therefore a starting point.
#! /usr/bin/perl -w
use strict;
my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT";
my @chars = split //, $screen;
my @starts;
foreach ( 0 .. scalar @chars - 1 ) {
push @starts, $_
if $chars[$_] ne 'X' && ( 0 == $_ || $chars[$_-1] eq 'X' );
}
print "$screen\n@starts\n";
Try checking out the bioperl modules, at http://search.cpan.org/dist/bioperl/, also. I'm not familiar with them, but they have a prepackaged routine for this. | [reply] [d/l] |
|
|
| [reply] |
Re: Split Function - Positions
by tkil (Monk) on Jun 02, 2004 at 05:03 UTC
|
Essentially what I wish to obtain is the starting
position for each member of the @genes
array within the original $screen variable.
I gave a solution using split in an
earlier reply. I think I would prefer to
solve this with a simple iterative regex, taking
advantage of the @- array, which tracks
the starting index of each sub-match:
my @genes;
my @gene_pos;
while ( $screen =~ m/([^X]+)/g )
{
push @genes, $1;
push @gene_pos, $-[1];
}
| [reply] [d/l] [select] |
Re: Split Function - Positions
by BrowserUk (Patriarch) on Jun 02, 2004 at 01:09 UTC
|
Update: Ignore this! It is much slower.
This might work out a little faster if performance is a issue--which it usually is with genome related stuff.
#! perl -slw
use strict;
my $re = '([^X]+)X*' . '(?:([^X]+)X*)?' x 100;
$re = qr[$re];
my $screen = "ATCGATCGXXXXXATCGATXXXACTGCTACGGTACXXXAATTATXGCGCGXXT";
$screen =~ $re;
print for @-[ 1 .. $#- ];
__END__
P:\test>test2
0
13
22
38
45
52
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] |
|
|
| [reply] |
|
|
push @posns, pos($screen) - length $1 while $screen =~ /([^X]+)/g;
is substantially faster than
push @posns, $-[ 0 ] while $screen =~ m[([^X]+)]g;
which surprised me. I'm not sure why that would be?
My best guess is that @- uses tie-style magic, and isn't populated unless it is accessed rather than when the regex runs? Perhaps the captures are made in the form of LVALUE refs and @- and @+ are derived from those if and when they are called for?
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] [select] |