janusmccarthy has asked for the wisdom of the Perl Monks concerning the following question:

Here's a question that's scrambled my brains, I'm designing right now so I don't have code. I'm kind of looking for a "big picture" answer.

I am working with BioPerl (not a bioperl question), which provides me with gene sequence data which I can pull out and manipulate as a large unbroken string of characters (several hundred to several thousand characters long).

It looks something like this...
atgcatgcatgcatgcatgcatgcatgcaattggccatgcatgcatgcaattggccgcat...
The eventual output will be displayed in a Text or ROText widget, in GenBank sequence format which simply means spaced out every ten characters and the index of the base pair is displayed like this...
1 atgcatgcat gcatgcatgc atgcatgcaa 31 ttggccatgc atgcatgcaa ttggccgcat ...
I want to be able to input a small subsequence of that sequence (say aattggcc which is in my sample string), then I want to highlight any found sequence in my formatted string. So if capital letters were highlighted, this would look something like...
1 atgcatgcat gcatgcatgc atgcatgcAA 31 TTGGCCatgc atgcatgcAA TTGGCCgcat ...
So I'm wondering about a nice clean way to do this, and it's kind of escaping me at the moment. I was thinking I could:
  1. Look for the subsequence in the unformatted string (I can do this!)
  2. Mark it in a way that isn't lost when formatting (maybe change the letters to uppercase) (I can do this!)
  3. Format the string and place it in the Text or ROText object (I can do this!)
  4. Tag the data there with a tag configure/regex search (I can do this!)
  5. Remove my old marks so that it's properly in the format
    (kind of at a loss here, I'm not sure if changing the text within a tag using FindAndReplaceAll will cause the widget to clear the tags, any help here would be FANTASTIC!)
  6. Highlight off the tags. (I can do this!)
I'm relatively new to perl, and I was thinking there might be a way easier way (maybe tag the unformatted sequence within the Text widget, and build the structure within the widget with a regex? can I place tags on the data outside the widget? is there a tool already that I just have to write a formatter for?). Any ideas and/or help with #5 would be greatly appreciated.

Replies are listed 'Best First'.
Re: Searching Formatting and Highlighting Text Problem
by zentara (Cardinal) on Apr 26, 2009 at 10:50 UTC

      Actually close to a much simpler example here, but not quite what I'm looking for:

      http://www.perlmonks.org/?node=169190

      Here's a quick example of the problem I have at this point. Let's say I want to search for 'gca'. There are 13 occurrences in my unformatted string. When I format the string however, the sequence gets broken up, spread across groups and lines, and I can't find them anymore. Here is the formatted string:

      1 atggcgacga aggccgtgtg cgtgctgaag ggcgacggcc cagtgcaggg catcatcaat 61 ttcgagcaga aggaaagtaa tggaccagtg aaggtgtggg gaagcattaa aggactgact 121 gaaggcctgc atggattcca tgttcatgag tttggagata atacagcagg ctgtaccagt 181 gcaggtcctc actttaatcc tctatccaga aaacacggtg ggccaaagga tgaagagagg 241 catgttggag acttgggcaa tgtgactgct gacaaagatg gtgtggccga tgtgtctatt 301 gaagattctg tgatctcact ctcaggagac cattgcatca ttggccgcac actggtggtc 361 catgaaaaag cagatgactt gggcaaaggt ggaaatgaag aaagtacaaa gacaggaaac 421 gctggaagtc gtttggcttg tggtgtaatt gggatcgccc aataaacatt cccttggatg 481 tagtctgagg cccct
      The triplet's I would find if I just searched the unformatted string are at (with problem strings in the formatted string bolded):
      • Base pair 55 (Row 1, Group 5, 5th letter in)
      • Base pair 60 (Row 1, Group 5, 10th letter in)
      • Base pair 66 (Row 2, Group 1, 6th letter in)
      • Base pair 104 (Row 2, Group 5, 4th letter in)
      • Base pair 129 (Row 3, Group 1, 9th letter in)
      • Base pair 166 (Row 3, Group 5, 6th letter in)
      • Base pair 181 (Row 4, Group 1, 1st letter in)
      • Base pair 240 (Row 4, Group 6, 10th letter in) ** BIG PROBLEM
      • Base pair 257 (Row 5, Group 2, 7th letter in)
      • Base pair 335 (Row 6, Group 4, 5th letter in)
      • Base pair 347 (Row 6, Group 5, 7th letter in)
      • Base pair 370 (Row 7, Group 1, 10th letter in)
      • Base pair 383 (Row 7, Group 3, 3rd letter in)

      The problem is that the search can go across group and across lines. A pure regex across the newlines won't work, because I'll end up highlighting the numbers. I suppose that I *could* go back in after tagging, find the numbers, and retag them back, and that might be an option.

      I tried setting some code to uppercase, then marking the uppercase letters, and then switching the letters back to lowercase. It cleared the highlight tags when I did that, so that option is out.

      I might try just building a widget where the position numbers are outside the Text widget, but it seems like a lot of work.

      Here's some test code that's a little easier to swallow than yours if anyone wants to play around with it.
      use Tk; $sequence=" 1 atggcgacga aggccgtgtg cgtgctgaag ggcgacggcc cagtg +caggg catcatcaat 61 ttcgagcaga aggaaagtaa tggaccagtg aaggtgtggg gaagcattaa aggac +tgact 121 gaaggcctgc atggattcca tgttcatgag tttggagata atacagcagg ctgta +ccagt 181 gcaggtcctc actttaatcc tctatccaga aaacacggtg ggccaaagga tgaag +agagg 241 catgttggag acttgggcaa tgtgactgct gacaaagatg gtgtggccga tgtgt +ctatt 301 gaagattctg tgatctcact ctcaggagac cattgcatca ttggccgcac actgg +tggtc 361 catgaaaaag cagatgactt gggcaaaggt ggaaatgaag aaagtacaaa gacag +gaaac 421 gctggaagtc gtttggcttg tggtgtaatt gggatcgccc aataaacatt ccctt +ggatg 481 tagtctgagg cccct"; # Main Window my $mainWindow = MainWindow->new(); $mainWindow->title("Regex Problem Example"); #Sets Title $mainWindow->minsize(qw(500 500)); $mainWindow->geometry('+500+200'); $mainWindow->optionAdd('*font'=>'Courier 10'); $ROText = $mainWindow->Scrolled('ROText', -scrollbars=>'osoe'); $ROText->Insert($sequence); $ROText->pack; highlightText($ROText, "gca"); MainLoop(); sub highlightText { my ($widget, $searchString) = @_; # Create a tag to configure the text $widget->tagConfigure('foundtag', -foreground => "white", -background => "red"); $widget->FindAll(-regex, -nocase, $searchString); if ($widget->tagRanges('sel')) { my %startfinish = $widget->tagRanges('sel'); foreach(sort keys %startfinish) { $widget->tagAdd("foundtag", $_, $startfinish{$_}); } $widget->tagRemove('sel', '1.0', 'end'); } }
Re: Searching Formatting and Highlighting Text Problem
by almut (Canon) on Apr 26, 2009 at 11:06 UTC
    5. Remove my old marks

    So if I'm understanding correctly, the problem is that the FindAndReplaceAll method does not support backreferences/eval in the replacement expression (like Perl's s/// does) — so something like $text->FindAndReplaceAll(-regexp,-case,'([ACGT])','lc($1)') doesn't work.

    Maybe you could instead retrieve the entire contents, do the lowercasing, and then put the modified content back into the widget:

    my $s = $text->Contents(); $s =~ tr/ACGT/acgt/; $text->Contents($s);
      You're not understanding correctly, and that's probably my fault. If I retrieve the entire contents and change it there, I've messed the tags that I put in the string to mark where I want to highlight. Here's a better set of sample code. I'm demonstrating my first idea if you set $idea=0. If you go into highlightText2 and uncomment where I've said to break, you can see why I'm not using the original idea, but it does appear better here then when I originally wrote the test code on slightly different text.
      use Tk; $idea = 0; $rawSequence="atggcgacgaaggccgtgtgcgtgctgaagggcgacggcccagtgcagggcatc +atcaatttcgagcagaaggaaagtaatggaccagtgaaggtgtggggaagcattaaaggactgactgaa +ggcctgcatggattccatgttcatgagtttggagataatacagcaggctgtaccagtgcaggtcctcac +tttaatcctctatccagaaaacacggtgggccaaaggatgaagagaggcatgttggagacttgggcaat +gtgactgctgacaaagatggtgtggccgatgtgtctattgaagattctgtgatctcactctcaggagac +cattgcatcattggccgcacactggtggtccatgaaaaagcagatgacttgggcaaaggtggaaatgaa +gaaagtacaaagacaggaaacgctggaagtcgtttggcttgtggtgtaattgggatcgcccaataaaca +ttcccttggatgtagtctgaggcccct"; my $mainWindow = MainWindow->new(); if( $idea) { # Main Window $mainWindow->title("Regex Problem Example 1"); $mainWindow->minsize(qw(500 500)); #Sets Minimum Size - note lack + of comma $mainWindow->geometry('+500+200'); #Sets Position - note lack of +comma $mainWindow->optionAdd('*font'=>'Courier 10'); $ROText = $mainWindow->Scrolled('ROText', -scrollbars=>'osoe'); $formattedSequence = formatSequence($rawSequence); $ROText->Insert($formattedSequence); $ROText->pack; highlightText($ROText, "gca"); } else { # Main Window $mainWindow->title("Regex Problem Example 2"); #Sets Title $mainWindow->minsize(qw(500 500)); #Sets Minimum Size - note lack + of comma $mainWindow->geometry('+500+200'); #Sets Position - note lack of +comma $mainWindow->optionAdd('*font'=>'Courier 10'); $ROText = $mainWindow->Scrolled('ROText', -scrollbars=>'osoe'); $formattedSequence = searchAndFormatSequence($rawSequence); $ROText->Insert($formattedSequence); $ROText->pack; #This works, but I can't make Uppercase->Lowercase #without removing the tags. highlightText2($ROText, "[A-Z]+"); } MainLoop(); sub highlightText { my ($widget, $searchString) = @_; #alter search string - HOW? # Create a tag to configure the text $widget->tagConfigure('foundtag', -foreground => "white", -background => "red"); $widget->FindAll(-regex, -nocase, $searchString); if ($widget->tagRanges('sel')) { my %startfinish = $widget->tagRanges('sel'); foreach(sort keys %startfinish) { $widget->tagAdd("foundtag", $_, $startfinish{$_}); } $widget->tagRemove('sel', '1.0', 'end'); } } sub highlightText2 { my ($widget, $searchString) = @_; #alter search string - HOW? # Create a tag to configure the text $widget->tagConfigure('foundtag', -foreground => "white", -background => "red"); $widget->FindAll(-regex, -case, $searchString); if ($widget->tagRanges('sel')) { my %startfinish = $widget->tagRanges('sel'); foreach(sort keys %startfinish) { $widget->tagAdd("foundtag", $_, $startfinish{$_}); } $widget->tagRemove('sel', '1.0', 'end'); } #UNCOMMENT TO BREAK! Doesn't like a regex in the replace string #$widget->FindAndReplaceAll(-regex, -case, "G", "g"); #$widget->FindAndReplaceAll(-regex, -case, "C", "c"); #$widget->FindAndReplaceAll(-regex, -case, "A", "a"); } sub searchAndFormatSequence { #I search the sequence and change any gca in the sequence to GCA t +hen #format the sequence, the end result of which is return " 1 atggcgacga aggccgtgtg cgtgctgaag ggcgacggcc cagt +GCAggG CAtcatcaat 61 ttcgaGCAga aggaaagtaa tggaccagtg aaggtgtggg gaaGCAttaa aggac +tgact 121 gaaggcctGC Atggattcca tgttcatgag tttggagata atacaGCAgg ctgta +ccagt 181 GCAggtcctc actttaatcc tctatccaga aaacacggtg ggccaaagga tgaag +agagG 241 CAtgttggag acttggGCAa tgtgactgct gacaaagatg gtgtggccga tgtgt +ctatt 301 gaagattctg tgatctcact ctcaggagac cattGCAtca ttggccGCAc actgg +tggtc 361 catgaaaaaG CAgatgactt ggGCAaaggt ggaaatgaag aaagtacaaa gacag +gaaac 421 gctggaagtc gtttggcttg tggtgtaatt gggatcgccc aataaacatt ccctt +ggatg 481 tagtctgagg cccct"; } sub formatSequence { #I do some really neat formatting here the end result of which is. +.. return " 1 atggcgacga aggccgtgtg cgtgctgaag ggcgacggcc cagt +gcaggg catcatcaat 61 ttcgagcaga aggaaagtaa tggaccagtg aaggtgtggg gaagcattaa aggac +tgact 121 gaaggcctgc atggattcca tgttcatgag tttggagata atacagcagg ctgta +ccagt 181 gcaggtcctc actttaatcc tctatccaga aaacacggtg ggccaaagga tgaag +agagg 241 catgttggag acttgggcaa tgtgactgct gacaaagatg gtgtggccga tgtgt +ctatt 301 gaagattctg tgatctcact ctcaggagac cattgcatca ttggccgcac actgg +tggtc 361 catgaaaaag cagatgactt gggcaaaggt ggaaatgaag aaagtacaaa gacag +gaaac 421 gctggaagtc gtttggcttg tggtgtaatt gggatcgccc aataaacatt ccctt +ggatg 481 tagtctgagg cccct"; }

        Can't you just do the lowercasing before you apply the ranges (using the uppercase approach you outlined)?  The tag positions in %startfinish would still be valid, as the case has no influence on length...

        my %startfinish = $widget->tagRanges('sel'); # undo uppercase my $s = $widget->Contents(); $s =~ tr/A-Z/a-z/; $widget->Contents($s); foreach(sort keys %startfinish) { $widget->tagAdd("foundtag", $_, $startfinish{$_}); }

        That would at least highlight all (lowercased) subsequences. But I'm still not entirely sure what the ultimate objective is... Is it just the highlighting?