comment on

You don't see it as buggy that the way that $/ matches a record depends on the layout of the data and your buffer size in some rather complex way, with the only guarantee being that if you suck in the whole stream then it will work as expected? It isn't enough to say, "The program does what it is coded to." If you are going to promise to allow people to use regexes to match input, then make some attempt to do it consistently, and if you can't then make some attempt to at least fail in an easily explainable way. For instance I can be OK with, "I can produce strange results if $/ is supposed to match more than one block." If you can't deliver it, then don't even seem to promise delivering it.

If you want to attempt to deliver the promise, then one idea is the approach that I suggested at Re: Regexes on Streams, which the code above implements a variation on. If the programmer wants to give a potentially infinite match, you do your best to satisfy them. Perhaps it is qr/.*/ and life sucks. Perhaps it is /[\r\n](?:\s*[\r\n]|)/ (ie match end of line and any following blank lines, with either Unix or DOS line endings) and even though it is potentially infinite, with real data it is also pretty sensible and shouldn't be broken too easily. (Note: the $/ example that Dominus used in his chapter was potentially infinite...)

A sample program to play with is the following. Save it and feed it different buffer sizes. The end of record expression is greedy, of size 12. Yet from sizes of 1-17 only twice does it produce the result which Dominus' original description would lead you to expect. And in a longer data example, it would continue to mess up, and there is no simple way to say that it shouldn't.

#! /usr/bin/perl -w
use strict;

sub blocks {
  my $fh = shift;
  my $blocksize = shift || 8192;
  sub {
    return unless read $fh, my($block), $blocksize;
    return $block;
  }
}

sub records {
  my $blocks = shift;
  my $terminator = @_ ? shift : quotemeta($/);
  my @records;
  my ($buf, $finished) = ("");
  sub {
    while (@records == 0 && ! $finished) {
      if (defined(my $block = $blocks->())) {
        $buf .= $block;
        my @newrecs = split /($terminator)/, $buf;
        while (@newrecs > 2) {
          push @records, shift(@newrecs).shift(@newrecs);
        }
        $buf = join "", @newrecs;
      } else {
        @records = $buf;
        $finished = 1;
      }
    }
    return shift(@records);
  }
}

my $iter_block = blocks(\*DATA, shift || 10);
my $iter_record = records($iter_block, "(?:foo\n)+");
while (my $record = $iter_record->()) {
  print "GOT A RECORD:\n$record\n";
}

__DATA__
hello
foo
foo
foo
world
foo
foo
foo
!
[download]

And about using capturing parens into a terminator pattern, there is no real reason for wanting to do so, but it is incredibly easy to do by accident when not thinking about it. If you don't know a fair amount about regexes, you could take a good while to figure out why things look weird and what to do to fix it. (Confession: When writing the above sample code I originally had capturing parens and was somewhat surprised at the results. This is how I became aware of that issue...)

In reply to Re: Re3: Regexes on Streams - Revisited! by tilly
in thread Regexes on Streams - Revisited! by tsee

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.