in reply to Efficiently Extracting a Range of Lines (was: Range This)

All of the answers so far have not correctly replicated the behaviour of the original code - I don't know if this is really an issue but multiple START and END (sic! and not STOP - there's a typo in the original post) aren't matched. To do so I would suggest:

my @good_stuff = $stuff =~ /^START\n(.*?)^END$/gms; # or alternatively with START and END around my @good_stuff = $stuff =~ /(^START$.*?^END$)/gms;
This captures into the array @good_stuff and can then be further processed. This is quite similar to the solution from tadman. Note the /m modifier to let ^ and $ match inside the string just before and after a newline. The /s modifier ensures that .*? matches everything including newlines.

Another issue are the benchmarks done here in this thread. These don't show anything, it depends on the structure of the real data.

So to sum up, you should always run benchmark tests on some real data to get an impression on how different methods compare. Try different test strings both for matching success and failure cases.

-- Hofmator

Replies are listed 'Best First'.
(particle) Re: Re: Efficiently Extracting a Range of Lines (was: Range This)
by particle (Vicar) on Jul 11, 2001 at 19:27 UTC
    i'm curious why you overlooked my solution. although it may not be the "best" way, my solution was created from lessons i learned from posts like Code Smarter, and Death to Dot Star!

    although my solution breaks on multiple START/STOP tags, this requirement was not specified in the question. i would add this functionality for a more general solution, but i'd also need to know if it should handle nested tags or not.

    my solution will, however, match START/STOP tags anywhere in the input stream, as was specified by the code in the original post. it will work if the STOP tag does not exist, as i got from the original data (granted this might be a typo). and it matches the behaviour of including the START/STOP tags in the results. mine includes it outputs a string, instead of a list, but that is easily remedied with split either in the return statements, or to be done outside the find_between_tags() function.

    ~Particle

      particle, I overlooked your solution on purpose ;-). That has absolutely nothing to do with the quality - all of them work fine on single START/END tags.

      I just was not sure how quick index in comparison to the regex solutions works. The other two are both regex and thus easy to compare. I thought index should be quicker than a regex on a fixed string:

      $i = index $stuff, 'START'; # compared to $stuff =~ /START/;
      but the benchmark suggested otherwise. The difference might be because of the function overhead and the surrounding code in your solution. But I was too lazy to test this with some proper benchmarks ...

      -- Hofmator

Re: Re: Efficiently Extracting a Range of Lines (was: Range This)
by skazat (Chaplain) on Jul 11, 2001 at 20:39 UTC
    hmm, I didn't specify multiple start/stops in the string that will be given, but in the application, this is very much a possibility (and is really the norm) of what you will find, I actually, after all this, split the string I get at the END match. to make an array of these buggers.

    The test example was pretty simplified, but in the Real World use of this, different START and END patterns are used, and these START and END parts are needed later down the line.

    Thanks for all your help so far, I've found that this may not have been the crux of my script slowness, even though this actual parsing of chunks 'o text isn't very optimized, the swubroutine that this is housed is never called more than it needs to.

    but still, converse among yourselves :)

     

    -justin simoni
    !skazat!