in reply to matching every occurrence of a regex

If that data is just a string how about some splitting instead
my $data = <<STR; BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST STR my @nums = map { (split)[1] } split /\n/, $data; print join(', ', @nums), $/; __output__ 91, 262, 293

HTH

_________
broquaint

Replies are listed 'Best First'.
Re: Re: matching every occurrence of a regex
by Hofmator (Curate) on Jan 07, 2003 at 14:06 UTC

    That's a good and very natural approach to the problem. What most other posters here didn't consider is that the whole sequence is stored in a string. This string might be read in from a file but doesn't have to be, so all these while (<FILE>) {...} approaches might not work.

    However - especially if the string is long - your double splitting might be doing a lot of unnecessary extra work. First creating an array by splitting on '\n', then a second array by splitting each line and then slicing this array.

    If the data is really as simple and consistent as is given in the example, this will do as well with much less effort

    my $data = <<STR; BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST STR my @nums = $data =~ /\s(\d+)\s/g; print join(', ', @nums), $/;

    If the real string differs from the given example and the lines contain e.g. more 'whitespace surrounded numbers', then the regex has to be massaged accordingly. Nevertheless the general idea should still work.

    -- Hofmator

      What most other posters here didn't consider is that the whole sequence is stored in a string. This string might be read in from a file but doesn't have to be, so all these while (<FILE>) {...} approaches might not work.

      If the real string differs from the given example and the lines contain e.g. more 'whitespace surrounded numbers', then the regex has to be massaged accordingly. Nevertheless the general idea should still work.

      In the second quote you speak of general idea; in the first you complain over that people show general ideas and don't bother about details. One could easily have included your quote but slightly modified for each other reply: "If the input is gotten/stored in any other way, then the loop construct has to be massaged accordingly. Nevertheless the general idea should still work."

      Many posts here are about general ideas (and that's good). It's silly to always have to point out that if it's read from a file it should use while (...) { ... } but if it's in some form of list it should be for (...) { ... }. The idea is still that the problem is solved by looping through every sequence. How you choose to do that is up to the final implementor. Imho, the questioneer should be skilled enough to know how to read a file line by line, or how to loop through an array.

      Personally I would use the same approach as you did, but that's irrelevant right now.

      What's more important to note is that none of the replies that used while (...) { ... } local()ized $_!

      ihb

        Well, the original messages stated that the protein sequence is "just a string of letters". From that I drew the conclusion that the whole thing is stored in a scalar. On rereading, maybe that's not so obvious anymore and my misinterpretation.

        Concerning the localizing of $_, that's the right thing to do in certain circumstances. However, we are showing here snippets of code without context, presenting general ideas as you write yourself. We can't know whether it's reasonable to localize $_ or not, the author of the script has to decide that himself. You could point out, though, (as a general remark) that it might be a good idea to use local $_ if you are suspecting that the person asking might not be aware of that.

        Update:When I'm talking about 'certain circumstances' I'm thinking about short scripts, acting similar to a unix filter with one main while (<>) {} loop. Localizing makes (most of the times) no sense there. However, ihb makes a very valid point for the more general case below.

        I'd agree that we mostly agree ;-)

        -- Hofmator