in reply to Re3: Regexes on Streams - Revisited!
in thread Regexes on Streams - Revisited!
If you want to attempt to deliver the promise, then one idea is the approach that I suggested at Re: Regexes on Streams, which the code above implements a variation on. If the programmer wants to give a potentially infinite match, you do your best to satisfy them. Perhaps it is qr/.*/ and life sucks. Perhaps it is /[\r\n](?:\s*[\r\n]|)/ (ie match end of line and any following blank lines, with either Unix or DOS line endings) and even though it is potentially infinite, with real data it is also pretty sensible and shouldn't be broken too easily. (Note: the $/ example that Dominus used in his chapter was potentially infinite...)
A sample program to play with is the following. Save it and feed it different buffer sizes. The end of record expression is greedy, of size 12. Yet from sizes of 1-17 only twice does it produce the result which Dominus' original description would lead you to expect. And in a longer data example, it would continue to mess up, and there is no simple way to say that it shouldn't.
And about using capturing parens into a terminator pattern, there is no real reason for wanting to do so, but it is incredibly easy to do by accident when not thinking about it. If you don't know a fair amount about regexes, you could take a good while to figure out why things look weird and what to do to fix it. (Confession: When writing the above sample code I originally had capturing parens and was somewhat surprised at the results. This is how I became aware of that issue...)#! /usr/bin/perl -w use strict; sub blocks { my $fh = shift; my $blocksize = shift || 8192; sub { return unless read $fh, my($block), $blocksize; return $block; } } sub records { my $blocks = shift; my $terminator = @_ ? shift : quotemeta($/); my @records; my ($buf, $finished) = (""); sub { while (@records == 0 && ! $finished) { if (defined(my $block = $blocks->())) { $buf .= $block; my @newrecs = split /($terminator)/, $buf; while (@newrecs > 2) { push @records, shift(@newrecs).shift(@newrecs); } $buf = join "", @newrecs; } else { @records = $buf; $finished = 1; } } return shift(@records); } } my $iter_block = blocks(\*DATA, shift || 10); my $iter_record = records($iter_block, "(?:foo\n)+"); while (my $record = $iter_record->()) { print "GOT A RECORD:\n$record\n"; } __DATA__ hello foo foo foo world foo foo foo !
|
---|
Replies are listed 'Best First'. | |
---|---|
Re5: Regexes on Streams - Revisited!
by Hofmator (Curate) on Oct 16, 2003 at 14:49 UTC |