Re: Efficiently Extracting a Range of Lines (was: Range This)

All of the answers so far have not correctly replicated the behaviour of the original code - I don't know if this is really an issue but multiple START and END (sic! and not STOP - there's a typo in the original post) aren't matched. To do so I would suggest:

my @good_stuff = $stuff =~ /^START\n(.*?)^END$/gms;

# or alternatively with START and END around
my @good_stuff = $stuff =~ /(^START$.*?^END$)/gms;
[download]

This captures into the array @good_stuff and can then be further processed. This is quite similar to the solution from tadman. Note the /m modifier to let ^ and $ match inside the string just before and after a newline. The /s modifier ensures that .*? matches everything including newlines.

Another issue are the benchmarks done here in this thread. These don't show anything, it depends on the structure of the real data.

The solution from tachyon - apart from breaking on multiple START/END combinations - is slow on something like
```
$stuff = << EOF;
blah
START
a few lines
END
lots of lots of lines
EOF
[download]
```
as it has to backtrack from the end of the string.
Something like tadman's solution works well on the example above but is slower than tachyon's on this
```
$stuff = << EOF;
blah
START
lots of lots of lines
END
blah
EOF
[download]
```
as the non-greedy .*? matching advances slower than greedy .*

So to sum up, you should always run benchmark tests on some real data to get an impression on how different methods compare. Try different test strings both for matching success and failure cases.

-- Hofmator

Comment on Re: Efficiently Extracting a Range of Lines (was: Range This) Select or Download Code

Replies are listed 'Best First'.
(particle) Re: Re: Efficiently Extracting a Range of Lines (was: Range This) by particle (Vicar) on Jul 11, 2001 at 19:27 UTC
i'm curious why you overlooked my solution. although it may not be the "best" way, my solution was created from lessons i learned from posts like Code Smarter, and Death to Dot Star! although my solution breaks on multiple START/STOP tags, this requirement was not specified in the question. i would add this functionality for a more general solution, but i'd also need to know if it should handle nested tags or not. my solution will, however, match START/STOP tags anywhere in the input stream, as was specified by the code in the original post. it will work if the STOP tag does not exist, as i got from the original data (granted this might be a typo). and it matches the behaviour of including the START/STOP tags in the results. mine includes it outputs a string, instead of a list, but that is easily remedied with split either in the return statements, or to be done outside the find_between_tags() function. ~Particle	[reply]
Re3: Efficiently Extracting a Range of Lines (was: Range This) by Hofmator (Curate) on Jul 11, 2001 at 19:45 UTC
particle, I overlooked your solution on purpose ;-). That has absolutely nothing to do with the quality - all of them work fine on single START/END tags. I just was not sure how quick index in comparison to the regex solutions works. The other two are both regex and thus easy to compare. I thought index should be quicker than a regex on a fixed string: `$i = index $stuff, 'START'; # compared to $stuff =~ /START/;` [download] but the benchmark suggested otherwise. The difference might be because of the function overhead and the surrounding code in your solution. But I was too lazy to test this with some proper benchmarks ... -- Hofmator	[reply] [d/l]
Re: Re: Efficiently Extracting a Range of Lines (was: Range This) by skazat (Chaplain) on Jul 11, 2001 at 20:39 UTC
hmm, I didn't specify multiple start/stops in the string that will be given, but in the application, this is very much a possibility (and is really the norm) of what you will find, I actually, after all this, split the string I get at the END match. to make an array of these buggers. The test example was pretty simplified, but in the Real World use of this, different START and END patterns are used, and these START and END parts are needed later down the line. Thanks for all your help so far, I've found that this may not have been the crux of my script slowness, even though this actual parsing of chunks 'o text isn't very optimized, the swubroutine that this is housed is never called more than it needs to. but still, converse among yourselves :) -justin simoni !skazat!	[reply]