neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
my $str_type1 = "ccaatTTTGACACACACAGAAgggca"; # no dash my $str_type2 = "--aatTTTGACACACACAGAAgggca"; # with dash # note: number of dash(es) can be >= 0
How can we have the universal function to identify the starting and ending position of the uppercase substring from 2 kinds of string above.
Index of the superstring is counted when 'atcgn' is found first. So for example

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Finding Start/End Position of the Uppercase Substring
by BrowserUk (Patriarch) on Jun 24, 2007 at 18:42 UTC

    Updated.

    Needs a better name and more tests:

    #! perl -slw use strict; sub findUC { if( $_[0] =~ m[^(-*)[^A-Z]*([A-Z]+)] ) { return ( $-[2] - $+[1] + 1, $+[2] - $+[1] ); } return ( 0, 0 ); } my $str_type1 = "ccaatTTTGACACACACAGAAgggca"; # no dash my $str_type2 = "--aatTTTGACACACACAGAAgggca"; # with dash printf "%s start:%d end %d\n", $_, findUC( $_ ) for $str_type1, $str_type2, 'ctcgttccgaatagacgaatatgcgat', '--tcgcgaataggaactatacgatacgatac', 'CGCTAGTCACACTTTACGGACCAacac', '--GTACTATTACGAGCTATCTAGATActag'; __END__ c:\test>junk4 ccaatTTTGACACACACAGAAgggca start:6 end 21 --aatTTTGACACACACAGAAgggca start:4 end 19 ctcgttccgaatagacgaatatgcgat start:0 end 0 --tcgcgaataggaactatacgatacgatac start:0 end 0 CGCTAGTCACACTTTACGGACCAacac start:1 end 23 --GTACTATTACGAGCTATCTAGATActag start:1 end 24

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding Start/End Position of the Uppercase Substring
by FunkyMonk (Bishop) on Jun 24, 2007 at 18:50 UTC
    Using your convention of the first letter in the string being 1, rather than the Perl way of 0. if you want to do anything in Perl with theses numbers you need to subtract one from both the start and end offsets.

    # 1 2 3 # 123456789012345678901234567890 my $str_type1 = "ccaatTTTGACACACACAGAAgggca"; # no dash my $str_type2 = "--aatTTTGACACACACAGAAgggca"; # with dash for ( $str_type1, $str_type2 ) { if ( m{^(-*)[^A-Z]*([A-Z]*)} ) { my ( $s, $e ) = ( $-[2] + 1, $+[2] ); $_ -= length $1 for $s, $e; print "$_\n"; print "From = $s to $e\n\n"; } } #output: ccaatTTTGACACACACAGAAgggca From = 6 to 21 --aatTTTGACACACACAGAAgggca From = 4 to 19

Re: Finding Start/End Position of the Uppercase Substring
by johngg (Canon) on Jun 24, 2007 at 22:03 UTC
    To cope with the variable leading hyphens and the counting from 1 rather than 0 I decided to substitute zero or more hyphens at the beginning of the string with a single underscore to get the position as the OP wanted. I also used look arounds and regex code blocks. This caused me problems until I realised that the code blocks had created closures around $str, $startPos and $endPos when they were lexical. Declaring them with local our got things working.

    use strict; use warnings; my @strings = qw{ ccaatTTTGACACACACAGAAgggca --aatTTTGACACACACAGAAgggca --aatTTTGACACACACAGAA ---aagctaagattca TTTGACACACACAGAAgggca ---TTTGACACACACAGAAgggca }; foreach my $string ( @strings ) { my ($sp, $ep) = ucRange($string); print qq{ String - $string\n}, qq{Start position - $sp\n}, qq{ End position - $ep\n\n}; } sub ucRange { local our $str = shift; $str =~ s{\A-*}{_}; local our $startPos = 0; $str =~ m{(?<=[a-z_])(?=[A-Z])(?{$startPos = pos $str})}; local our $endPos = 0; $str =~ m{(?<=[A-Z])(?=[A-Z](?:[a-z]|\z))(?{$endPos = pos $str})}; return ($startPos, $endPos); }

    The output.

    String - ccaatTTTGACACACACAGAAgggca Start position - 6 End position - 21 String - --aatTTTGACACACACAGAAgggca Start position - 4 End position - 19 String - --aatTTTGACACACACAGAA Start position - 4 End position - 19 String - ---aagctaagattca Start position - 0 End position - 0 String - TTTGACACACACAGAAgggca Start position - 1 End position - 16 String - ---TTTGACACACACAGAAgggca Start position - 1 End position - 16

    I hope this is of interest.

    Cheers,

    JohnGG

    Update: Added string with no uppercase to check that script handled that.

Re: Finding Start/End Position of the Uppercase Substring
by shigetsu (Hermit) on Jun 24, 2007 at 19:37 UTC
    Using pos and extracting character positions and according actual offsets (useful to substr) it could look as outlined below. I'm not saying however, that this 'solution' is better (just another possible approach):
    use strict; use warnings; my @strings = do { local $/; split /\n/, <DATA> }; foreach my $str (@strings) { my $ret = offset($str); my $substring = substr($ret->[0], $ret->[2][0], $ret->[2][1]); print <<"EOT"; $substring start character: $ret->[1][0] end character: $ret->[1][1] start offset: $ret->[2][0] end offset: $ret->[2][1] EOT } sub offset { my $str = shift; my $hyphens = 0; $hyphens++ while $str =~ /-/g; $str =~ /[A-Z]/g and my $pos_start = pos($str); $str =~ /[a-z]/g and my $pos_end = pos($str); return [ $str, [ ($pos_start - $hyphens), ($pos_end - $hyphens) - 1 ], [ $pos_start - 1, ($pos_end - $pos_start) - 1 ] ]; } __DATA__ ccaatTTTGACACACACAGAAgggca --aatTTTGACACACACAGAAgggca
    outputs
    TTTGACACACACAGA start character: 6 end character: 21 start offset: 5 end offset: 15 TTTGACACACACAGA start character: 4 end character: 19 start offset: 5 end offset: 15
    Update: fix formatting.
      my $hyphens = 0; $hyphens++ while $str =~ /-/g;

      Although the data given seems clean in this regard, your code will give wrong results if there are any hyphens in the string other than leading ones. Capturing zero or more hyphens at the beginning of the string and finding the length of the capture might be safer.

      my $hyphens = length $1 if $str =~ m{\A(-*)};

      The match will always succeed so if there are no leading hyphens the length of the capture will be zero.

      $ perl -Mstrict -Mwarnings -le ' > my @strings = qw{--aacgtACG ctgGTTAtga}; > foreach my $str ( @strings ) > { > my $hyphens = length $1 if $str =~ m{\A(-*)}; > print qq{$str - $hyphens}; > }' --aacgtACG - 2 ctgGTTAtga - 0 $

      Cheers,

      JohnGG

Re: Finding Start/End Position of the Uppercase Substring
by ysth (Canon) on Jun 24, 2007 at 19:21 UTC
    I'd like to see what you've already tried and an example of how this function would be called.