in reply to Re: matching every occurrence of a regex
in thread matching every occurrence of a regex

That's a good and very natural approach to the problem. What most other posters here didn't consider is that the whole sequence is stored in a string. This string might be read in from a file but doesn't have to be, so all these while (<FILE>) {...} approaches might not work.

However - especially if the string is long - your double splitting might be doing a lot of unnecessary extra work. First creating an array by splitting on '\n', then a second array by splitting each line and then slicing this array.

If the data is really as simple and consistent as is given in the example, this will do as well with much less effort

my $data = <<STR; BC001593 91 NPSL BC001593 262 NASS BC001593 293 NAST STR my @nums = $data =~ /\s(\d+)\s/g; print join(', ', @nums), $/;

If the real string differs from the given example and the lines contain e.g. more 'whitespace surrounded numbers', then the regex has to be massaged accordingly. Nevertheless the general idea should still work.

-- Hofmator

Replies are listed 'Best First'.
Re: Re: Re: matching every occurrence of a regex
by ihb (Deacon) on Jan 07, 2003 at 16:44 UTC
    What most other posters here didn't consider is that the whole sequence is stored in a string. This string might be read in from a file but doesn't have to be, so all these while (<FILE>) {...} approaches might not work.

    If the real string differs from the given example and the lines contain e.g. more 'whitespace surrounded numbers', then the regex has to be massaged accordingly. Nevertheless the general idea should still work.

    In the second quote you speak of general idea; in the first you complain over that people show general ideas and don't bother about details. One could easily have included your quote but slightly modified for each other reply: "If the input is gotten/stored in any other way, then the loop construct has to be massaged accordingly. Nevertheless the general idea should still work."

    Many posts here are about general ideas (and that's good). It's silly to always have to point out that if it's read from a file it should use while (...) { ... } but if it's in some form of list it should be for (...) { ... }. The idea is still that the problem is solved by looping through every sequence. How you choose to do that is up to the final implementor. Imho, the questioneer should be skilled enough to know how to read a file line by line, or how to loop through an array.

    Personally I would use the same approach as you did, but that's irrelevant right now.

    What's more important to note is that none of the replies that used while (...) { ... } local()ized $_!

    ihb

      Well, the original messages stated that the protein sequence is "just a string of letters". From that I drew the conclusion that the whole thing is stored in a scalar. On rereading, maybe that's not so obvious anymore and my misinterpretation.

      Concerning the localizing of $_, that's the right thing to do in certain circumstances. However, we are showing here snippets of code without context, presenting general ideas as you write yourself. We can't know whether it's reasonable to localize $_ or not, the author of the script has to decide that himself. You could point out, though, (as a general remark) that it might be a good idea to use local $_ if you are suspecting that the person asking might not be aware of that.

      Update:When I'm talking about 'certain circumstances' I'm thinking about short scripts, acting similar to a unix filter with one main while (<>) {} loop. Localizing makes (most of the times) no sense there. However, ihb makes a very valid point for the more general case below.

      I'd agree that we mostly agree ;-)

      -- Hofmator

        Concerning the localizing of $_, that's the right thing to do in certain circumstances.

        I'd argue that almost always that's the right thing to do. I'd say that it's an exception not to localize $_, and so I really think that it should be put in demonstrative code--especially code targeting not-too-advanced Perl programmers. I'd go as far as putting a "Just do it unless you understand why you usually should do it and have a good reason not to" mark on this issue.

        The result of not localizing $_ would in practice be to do $_ = undef, unless there's something that breaks out of the loop. The loop will continue until $_ is undefined. If you actually want $_ to be undefined after the loop then you're probably off better by explicitly undefining $_ after the loop. Or you probably should ask yourself why you want to explicitly undefine it.

        Afaik (but I have no safe source at the moment--and I started Perling right about v5.6's release so I have no own historical perspective), foreach didn't use to localize its associated variable. At some time the porters (or whoever it was) decided that foreach really ought to localize its variable. And I think everyone agrees that's a good thing. Why while wasn't given the same treatment I can only speculate about. I find it somewhat likely though that constructs like the one below could've had anything to do with it. (If such decision-making ever took place.)
        /pat/ or last while <FOO>; print;
        You could point out, though, (as a general remark) that it might be a good idea to use local $_

        ... and I sure will. ;) (Actually, this post isn't directly targeted towards you, as you seem to mostly agree with me.)

        This is such a serious topic it's even been mentioned in Sins of Perl Revisited.

        Personally I'm quite paranoid against modules that are likely to use this particular while loop. I always (unless I'm familiar with the author and knows that such mistakes aren't likely to happen) check the source to make sure that the module won't make me scratch my head for hours and hours because of a destroyed $_. Actually, I entertained myself for a little while looking for places of improper non-localization of $_ in my Perl installation. Out of 1076 scanned .pm files I found 28 that matched the very simple pattern /while \s* \(? \s* </x. Out of these 28 modules 13 (!) were not localizing $_ properly. I checked all 28 modules briefly to make sure that the match was indeed code (not in quotes, pod, or what-not), and I also tried to see if $_ was localized in the current subroutine, and if not if it was used in some particular way. (In Pod::Functions the code was at file scope, but I still marked it as improper use.) But I don't claim to be perfect, so some cases below might have legitimate reasons for not localizing $_.

        (It seems like the authors were consistent. Those that localized $_ did that for all whiles, whereas those that didn't localize for one case didn't localize for any case.)

        ihb

        Curious about the 13 modules?