Becky has asked for the wisdom of the Perl Monks concerning the following question:

I have a program which looks through a protein sequence (just a string of letters) to see if it has N's in it. The results look roughly like this:

BC001593 91 NPSL
BC001593 262 NASS
BC001593 293 NAST

I just need to match the numbers (91, 262, 293...) as these are the positions of the N's. I've tried this:

if ($string =~/\w{1,12}\s+(\d{1,5})\s+[a-zA-Z]{4}/sm){ $sites = $1; }

and it matches the first number 91, fine, but no more. I tried s and m modifiers but no luck. How can I search through every line and record the value of every match? In addition, there can be any number of such letters present, so I need to search until there are no more lines left. Any ideas?

Replies are listed 'Best First'.
Re: matching every occurrence of a regex
by broquaint (Abbot) on Jan 07, 2003 at 13:41 UTC
    If that data is just a string how about some splitting instead
    my $data = <<STR; BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST STR my @nums = map { (split)[1] } split /\n/, $data; print join(', ', @nums), $/; __output__ 91, 262, 293

    HTH

    _________
    broquaint

      That's a good and very natural approach to the problem. What most other posters here didn't consider is that the whole sequence is stored in a string. This string might be read in from a file but doesn't have to be, so all these while (<FILE>) {...} approaches might not work.

      However - especially if the string is long - your double splitting might be doing a lot of unnecessary extra work. First creating an array by splitting on '\n', then a second array by splitting each line and then slicing this array.

      If the data is really as simple and consistent as is given in the example, this will do as well with much less effort

      my $data = <<STR; BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST STR my @nums = $data =~ /\s(\d+)\s/g; print join(', ', @nums), $/;

      If the real string differs from the given example and the lines contain e.g. more 'whitespace surrounded numbers', then the regex has to be massaged accordingly. Nevertheless the general idea should still work.

      -- Hofmator

        What most other posters here didn't consider is that the whole sequence is stored in a string. This string might be read in from a file but doesn't have to be, so all these while (<FILE>) {...} approaches might not work.

        If the real string differs from the given example and the lines contain e.g. more 'whitespace surrounded numbers', then the regex has to be massaged accordingly. Nevertheless the general idea should still work.

        In the second quote you speak of general idea; in the first you complain over that people show general ideas and don't bother about details. One could easily have included your quote but slightly modified for each other reply: "If the input is gotten/stored in any other way, then the loop construct has to be massaged accordingly. Nevertheless the general idea should still work."

        Many posts here are about general ideas (and that's good). It's silly to always have to point out that if it's read from a file it should use while (...) { ... } but if it's in some form of list it should be for (...) { ... }. The idea is still that the problem is solved by looping through every sequence. How you choose to do that is up to the final implementor. Imho, the questioneer should be skilled enough to know how to read a file line by line, or how to loop through an array.

        Personally I would use the same approach as you did, but that's irrelevant right now.

        What's more important to note is that none of the replies that used while (...) { ... } local()ized $_!

        ihb
Re: matching every occurrence of a regex
by helgi (Hermit) on Jan 07, 2003 at 13:49 UTC
    First, I would suggest an entirely different approach, something like

    while (<DATA>) { next if /^\s+$/; my (undef,$pos,undef) = split; print "$pos\n"; } __END__ BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST

    Second, I can find nothing wrong with your regex above, which gives me the correct output when I enclose it in a loop similar to the one above:

    while (<DATA>) { next if /^\s+$/; my $sites = 0; if (/\w{1,12}\s+(\d{1,5})\s+[a-zA-Z]{4}/sm){$sites = $1} print "$sites\n"; } __END__ BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST
    Therefore, I suggest that there is something wrong with your looping structure, rather than your regex.

    I hope this helps.

    --
    Regards,
    Helgi Briem
    helgi AT decode DOT is

Re: matching every occurrence of a regex
by virtualsue (Vicar) on Jan 07, 2003 at 14:01 UTC
    Hi Becky,
    It would be interesting to know how $string gets its value. It appears you are only getting it one line at a time, which is why you only get the first match. I've modified your program very slightly by adding a loop and it prints out all 3 values from the sample lines you provided.
    while (<DATA>) { if (/\w{1,12}\s+(\d{1,5})\s+[a-zA-Z]{4}/sm){ print $1, "\n"; } } __DATA__ BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST
    You know your data better than we do, but on the basis of what you've said it looks as though the regex could be simplified or even eliminated through the use of split like so:
    while (<DATA>) { my $pos = (split ' ')[1]; print $pos,"\n"; } __DATA__ BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST
    Just food for thought. This split trick will only work if your value of interest is always the 2nd field on each line.
Re: matching every occurrence of a regex
by foxops (Monk) on Jan 07, 2003 at 13:43 UTC
    Have you tried local $/ ? The regex seems to function fine if you flatten the file. This works for me:
    use strict; my($FILE,$SITE); print "Protein Sequencer\n\nInput Dataset file name: "; $FILE = <STDIN>; #chomp($FILE); local $/; #Null the $/ to search through a flat file print "\nLoading Dataset - Be patient.\n"; open DATA, $FILE or die $!;# Open File $_ = <DATA>;# Load File to Ram close DATA or die $!;# Close File print "\"N\" Sites\n---------\n"; while ($_ =~ m/\w{1,12}\s+(\d{1,5})\s+[a-zA-Z]{4}/gs) { $SITE = $1; print "$SITE\n"; }
Re: matching every occurrence of a regex
by emilford (Friar) on Jan 07, 2003 at 13:43 UTC
    If all you wanted to do was grab the middle number on each line, this worked for me:
    while(<DATA>) { if (m/\w{1,12}\s+(\d{1,5})\s+[a-zA-Z]{4}/) { my $site = $1; print "$site\n"; } } __DATA__ BC001593 91 NPSL BC001593 260 NASS BC001593 293 NAST # the output 91 263 293
    As far as performing this regex until there are no more lines, I think it depends on where the protein strings are stored. Either way, if you can get the lines in to an array, you could use the foreach construct to touch every one. HTH
Re: matching every occurrence of a regex
by jacques (Priest) on Jan 07, 2003 at 13:44 UTC
    You are making this more difficult than it is.

    $string =~ m/\s(\d+)\s/; $nposition = $1;

    If the sequences are in a file, you can just open the file and read through every line with a while loop. If you find a match, you can put it in an array.