Finding Start/End Position of the Uppercase Substring

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding Start/End Position of the Uppercase Substring by BrowserUk (Patriarch) on Jun 24, 2007 at 18:42 UTC
Updated. Needs a better name and more tests: #! perl -slw use strict; sub findUC { if( $_[0] =~ m[^(-)[^A-Z]([A-Z]+)] ) { return ( $-[2] - $+[1] + 1, $+[2] - $+[1] ); } return ( 0, 0 ); } my $str_type1 = "ccaatTTTGACACACACAGAAgggca"; # no dash my $str_type2 = "--aatTTTGACACACACAGAAgggca"; # with dash printf "%s start:%d end %d\n", $_, findUC( $_ ) for $str_type1, $str_type2, 'ctcgttccgaatagacgaatatgcgat', '--tcgcgaataggaactatacgatacgatac', 'CGCTAGTCACACTTTACGGACCAacac', '--GTACTATTACGAGCTATCTAGATActag'; __END__ c:\test>junk4 ccaatTTTGACACACACAGAAgggca start:6 end 21 --aatTTTGACACACACAGAAgggca start:4 end 19 ctcgttccgaatagacgaatatgcgat start:0 end 0 --tcgcgaataggaactatacgatacgatac start:0 end 0 CGCTAGTCACACTTTACGGACCAacac start:1 end 23 --GTACTATTACGAGCTATCTAGATActag start:1 end 24 [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: Finding Start/End Position of the Uppercase Substring by FunkyMonk (Bishop) on Jun 24, 2007 at 18:50 UTC
Using your convention of the first letter in the string being 1, rather than the Perl way of 0. if you want to do anything in Perl with theses numbers you need to subtract one from both the start and end offsets. `# 1 2 3 # 123456789012345678901234567890 my $str_type1 = "ccaatTTTGACACACACAGAAgggca"; # no dash my $str_type2 = "--aatTTTGACACACACAGAAgggca"; # with dash for ( $str_type1, $str_type2 ) { if ( m{^(-)[^A-Z]([A-Z]*)} ) { my ( $s, $e ) = ( $-[2] + 1, $+[2] ); $_ -= length $1 for $s, $e; print "$_\n"; print "From = $s to $e\n\n"; } } #output: ccaatTTTGACACACACAGAAgggca From = 6 to 21 --aatTTTGACACACACAGAAgggca From = 4 to 19` [download]	[reply] [d/l]
Re: Finding Start/End Position of the Uppercase Substring by johngg (Canon) on Jun 24, 2007 at 22:03 UTC
To cope with the variable leading hyphens and the counting from 1 rather than 0 I decided to substitute zero or more hyphens at the beginning of the string with a single underscore to get the position as the OP wanted. I also used look arounds and regex code blocks. This caused me problems until I realised that the code blocks had created closures around `$str`, `$startPos` and `$endPos` when they were lexical. Declaring them with `local our` got things working. use strict; use warnings; my @strings = qw{ ccaatTTTGACACACACAGAAgggca --aatTTTGACACACACAGAAgggca --aatTTTGACACACACAGAA ---aagctaagattca TTTGACACACACAGAAgggca ---TTTGACACACACAGAAgggca }; foreach my $string ( @strings ) { my ($sp, $ep) = ucRange($string); print qq{ String - $string\n}, qq{Start position - $sp\n}, qq{ End position - $ep\n\n}; } sub ucRange { local our $str = shift; $str =~ s{\A-}{_}; local our $startPos = 0; $str =~ m{(?<=[a-z_])(?=[A-Z])(?{$startPos = pos $str})}; local our $endPos = 0; $str =~ m{(?<=[A-Z])(?=[A-Z](?:[a-z]\|\z))(?{$endPos = pos $str})}; return ($startPos, $endPos); } [download] The output. `String - ccaatTTTGACACACACAGAAgggca Start position - 6 End position - 21 String - --aatTTTGACACACACAGAAgggca Start position - 4 End position - 19 String - --aatTTTGACACACACAGAA Start position - 4 End position - 19 String - ---aagctaagattca Start position - 0 End position - 0 String - TTTGACACACACAGAAgggca Start position - 1 End position - 16 String - ---TTTGACACACACAGAAgggca Start position - 1 End position - 16` [download] I hope this is of interest. Cheers, JohnGG Update:* Added string with no uppercase to check that script handled that.	[reply] [d/l] [select]
Re: Finding Start/End Position of the Uppercase Substring by shigetsu (Hermit) on Jun 24, 2007 at 19:37 UTC
Using pos and extracting character positions and according actual offsets (useful to substr) it could look as outlined below. I'm not saying however, that this 'solution' is better (just another possible approach): use strict; use warnings; my @strings = do { local $/; split /\n/, <DATA> }; foreach my $str (@strings) { my $ret = offset($str); my $substring = substr($ret->[0], $ret->[2][0], $ret->[2][1]); print <<"EOT"; $substring start character: $ret->[1][0] end character: $ret->[1][1] start offset: $ret->[2][0] end offset: $ret->[2][1] EOT } sub offset { my $str = shift; my $hyphens = 0; $hyphens++ while $str =~ /-/g; $str =~ /[A-Z]/g and my $pos_start = pos($str); $str =~ /[a-z]/g and my $pos_end = pos($str); return [ $str, [ ($pos_start - $hyphens), ($pos_end - $hyphens) - 1 ], [ $pos_start - 1, ($pos_end - $pos_start) - 1 ] ]; } __DATA__ ccaatTTTGACACACACAGAAgggca --aatTTTGACACACACAGAAgggca [download] outputs `TTTGACACACACAGA start character: 6 end character: 21 start offset: 5 end offset: 15 TTTGACACACACAGA start character: 4 end character: 19 start offset: 5 end offset: 15` [download] Update: fix formatting.	[reply] [d/l] [select]
Re^2: Finding Start/End Position of the Uppercase Substring by johngg (Canon) on Jun 25, 2007 at 09:49 UTC
`my $hyphens = 0; $hyphens++ while $str =~ /-/g;` [download] Although the data given seems clean in this regard, your code will give wrong results if there are any hyphens in the string other than leading ones. Capturing zero or more hyphens at the beginning of the string and finding the length of the capture might be safer. `my $hyphens = length $1 if $str =~ m{\A(-)};` [download] The match will always succeed so if there are no leading hyphens the length of the capture will be zero. `$ perl -Mstrict -Mwarnings -le ' > my @strings = qw{--aacgtACG ctgGTTAtga}; > foreach my $str ( @strings ) > { > my $hyphens = length $1 if $str =~ m{\A(-)}; > print qq{$str - $hyphens}; > }' --aacgtACG - 2 ctgGTTAtga - 0 $` [download] Cheers, JohnGG	[reply] [d/l] [select]
Re: Finding Start/End Position of the Uppercase Substring by ysth (Canon) on Jun 24, 2007 at 19:21 UTC
I'd like to see what you've already tried and an example of how this function would be called.	[reply]