matching every occurrence of a regex

Becky has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: matching every occurrence of a regex by broquaint (Abbot) on Jan 07, 2003 at 13:41 UTC
If that data is just a string how about some `split`ting instead `my $data = <<STR; BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST STR my @nums = map { (split)[1] } split /\n/, $data; print join(', ', @nums), $/; __output__ 91, 262, 293` [download] HTH `_________ broquaint`	[reply] [d/l]
Re: Re: matching every occurrence of a regex by Hofmator (Curate) on Jan 07, 2003 at 14:06 UTC
That's a good and very natural approach to the problem. What most other posters here didn't consider is that the whole sequence is stored in a string. This string might be read in from a file but doesn't have to be, so all these `while (<FILE>) {...}` approaches might not work. However - especially if the string is long - your double splitting might be doing a lot of unnecessary extra work. First creating an array by splitting on '\n', then a second array by splitting each line and then slicing this array. If the data is really as simple and consistent as is given in the example, this will do as well with much less effort `my $data = <<STR; BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST STR my @nums = $data =~ /\s(\d+)\s/g; print join(', ', @nums), $/;` [download] If the real string differs from the given example and the lines contain e.g. more 'whitespace surrounded numbers', then the regex has to be massaged accordingly. Nevertheless the general idea should still work. -- Hofmator	[reply] [d/l] [select]
Re: Re: Re: matching every occurrence of a regex by ihb (Deacon) on Jan 07, 2003 at 16:44 UTC
What most other posters here didn't consider is that the whole sequence is stored in a string. This string might be read in from a file but doesn't have to be, so all these while (<FILE>) {...} approaches might not work. If the real string differs from the given example and the lines contain e.g. more 'whitespace surrounded numbers', then the regex has to be massaged accordingly. Nevertheless the general idea should still work. In the second quote you speak of general idea; in the first you complain over that people show general ideas and don't bother about details. One could easily have included your quote but slightly modified for each other reply: "If the input is gotten/stored in any other way, then the loop construct has to be massaged accordingly. Nevertheless the general idea should still work." Many posts here are about general ideas (and that's good). It's silly to always have to point out that if it's read from a file it should use `while (...) { ... }` but if it's in some form of list it should be `for (...) { ... }`. The idea is still that the problem is solved by looping through every sequence. How you choose to do that is up to the final implementor. Imho, the questioneer should be skilled enough to know how to read a file line by line, or how to loop through an array. Personally I would use the same approach as you did, but that's irrelevant right now. What's more important to note is that none of the replies that used `while (...) { ... }` `local()`ized `$_`! `ihb`	[reply] [d/l] [select]
Re4: matching every occurrence of a regex by Hofmator (Curate) on Jan 07, 2003 at 17:13 UTC
Localizing $_ and while (<FH>) { ... }. (Is really: "Re: Re4: matching every occurrence of a regex") by ihb (Deacon) on Jan 08, 2003 at 23:21 UTC
Some notes below your chosen depth have not been shown here
Re: matching every occurrence of a regex by helgi (Hermit) on Jan 07, 2003 at 13:49 UTC
First, I would suggest an entirely different approach, something like `while (<DATA>) { next if /^\s+$/; my (undef,$pos,undef) = split; print "$pos\n"; } __END__ BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST` [download] Second, I can find nothing wrong with your regex above, which gives me the correct output when I enclose it in a loop similar to the one above: `while (<DATA>) { next if /^\s+$/; my $sites = 0; if (/\w{1,12}\s+(\d{1,5})\s+[a-zA-Z]{4}/sm){$sites = $1} print "$sites\n"; } __END__ BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST` [download] Therefore, I suggest that there is something wrong with your looping structure, rather than your regex. I hope this helps. -- Regards, Helgi Briem helgi AT decode DOT is	[reply] [d/l] [select]
Re: matching every occurrence of a regex by virtualsue (Vicar) on Jan 07, 2003 at 14:01 UTC
Hi Becky, It would be interesting to know how $string gets its value. It appears you are only getting it one line at a time, which is why you only get the first match. I've modified your program very slightly by adding a loop and it prints out all 3 values from the sample lines you provided. `while (<DATA>) { if (/\w{1,12}\s+(\d{1,5})\s+[a-zA-Z]{4}/sm){ print $1, "\n"; } } __DATA__ BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST` [download] You know your data better than we do, but on the basis of what you've said it looks as though the regex could be simplified or even eliminated through the use of split like so: `while (<DATA>) { my $pos = (split ' ')[1]; print $pos,"\n"; } __DATA__ BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST` [download] Just food for thought. This split trick will only work if your value of interest is always the 2nd field on each line.	[reply] [d/l] [select]
Re: matching every occurrence of a regex by foxops (Monk) on Jan 07, 2003 at 13:43 UTC
Have you tried `local $/` ? The regex seems to function fine if you flatten the file. This works for me: `use strict; my($FILE,$SITE); print "Protein Sequencer\n\nInput Dataset file name: "; $FILE = <STDIN>; #chomp($FILE); local $/; #Null the $/ to search through a flat file print "\nLoading Dataset - Be patient.\n"; open DATA, $FILE or die $!;# Open File $_ = <DATA>;# Load File to Ram close DATA or die $!;# Close File print "\"N\" Sites\n---------\n"; while ($_ =~ m/\w{1,12}\s+(\d{1,5})\s+[a-zA-Z]{4}/gs) { $SITE = $1; print "$SITE\n"; }` [download]	[reply] [d/l] [select]
Re: matching every occurrence of a regex by emilford (Friar) on Jan 07, 2003 at 13:43 UTC
If all you wanted to do was grab the middle number on each line, this worked for me: `while(<DATA>) { if (m/\w{1,12}\s+(\d{1,5})\s+[a-zA-Z]{4}/) { my $site = $1; print "$site\n"; } } __DATA__ BC001593 91 NPSL BC001593 260 NASS BC001593 293 NAST # the output 91 263 293` [download] As far as performing this regex until there are no more lines, I think it depends on where the protein strings are stored. Either way, if you can get the lines in to an array, you could use the foreach construct to touch every one. HTH	[reply] [d/l]
Re: matching every occurrence of a regex by jacques (Priest) on Jan 07, 2003 at 13:44 UTC
You are making this more difficult than it is. `$string =~ m/\s(\d+)\s/; $nposition = $1;` [download] If the sequences are in a file, you can just open the file and read through every line with a while loop. If you find a match, you can put it in an array.	[reply] [d/l]