Usage of regular expressions in input separator

archer has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Usage of regular expressions in input separator by Eliya (Vicar) on Dec 30, 2011 at 12:32 UTC
No, this is not possible — as mentioned in the docs: "Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)"	[reply]
Re: Usage of regular expressions in input separator by JavaFan (Canon) on Dec 30, 2011 at 12:51 UTC
No, it's not possible. I expect the reason to be is that you may have to read the entire input stream up to EOF with some patterns - and put them back on the input stream if there's no longer match. (Suppose your delimiter is $/ = /(.).*\1/), and that may be a costly (specially in memory usage) operation -- or your process could just "hang" forever (if it's trying to read all your standard input, or reading from a (bidirectional) pipe or network socket). I can see the point, but I would be willing to pay the price. Sure, in degenerated cases it would be costly (so, don't do that), in practice, people would use patterns (like the one you gave), that only requires a limited lookahead. But it's too late in the game to change $/ from a fixed string to a pattern -- not does it seem to be an itch of any of the active porters. So, I don't expect this to change any time soon.	[reply]
Re: Usage of regular expressions in input separator by NetWallah (Canon) on Dec 30, 2011 at 14:49 UTC
Depending on how complex your requirements are, you may be able to use Stream::Reader to match one of multiple delimiters, and accomplish your task. Although it does not support regular expressions, in your example case, you could use `map {"Separator $_"} 0..9` [download] as your delimiter list. "Battle not with trolls, lest ye become a troll; and if you gaze into the Internet, the Internet gazes also into you." -Friedrich Nietzsche: A Dynamic Translation	[reply] [d/l]
Re: Usage of regular expressions in input separator by hyvatti (Initiate) on Nov 27, 2024 at 07:43 UTC
With PerlIO::via you can add a layer that converts whatever you want to line feeds. For example, if you want to accept CR and LF as line feeds: `package PerlIO::via::normeol; sub FILL { my ($obj,$fh) = @_; my ($c); my $n = read ($fh, $c, 1); return undef unless $n; $c =~ tr/\r/\n/; return $c; } 1; use PerlIO::via::normeol; open (A, "<:via(normeol)", "foo.bar"); while (<A>) { ...` [download]	[reply] [d/l]
Re: Usage of regular expressions in input separator by AnomalousMonk (Archbishop) on Dec 31, 2011 at 19:11 UTC
Quite by accident, I happened on a discussion in Dominus's Higher-Order Perl (free PDF download) of the "make `$/` a regex" question in section 8.1.1, "Emulating the <> Operator".	[reply] [d/l]
Re: Usage of regular expressions in input separator by ww (Archbishop) on Dec 30, 2011 at 15:45 UTC
Since you, Robin Hood, provided no sample data nor indication of your required output, this is a WAG... but it may provide a workaround... or some ideas for one. #!/usr/bin/perl use Modern::Perl; use Data::Dumper; #945627 Workaround if the distinction among elements in each data # segment need not be retained; if retention # is required, read DATA into a HoA with the # separator-and-its-following-digit(s) as keys. say "\n\t \$/ is a string, not a regex," . "\n\t so, using an input_separator without any regex metachar \n"; $/ = "FOO"; my @newarr; my @arr = <DATA>; for my $item(@arr) { $item =~ s/\n//sg; if ( $item =~ /^\d+(.+?)(?:FOO)$/s ) { my $out = $1; push @newarr, $out; } else { say "\t Disgarding $item (ie, \$arr[1])"; # discarding the in +itial "FOO" in $arr[1] } } print Dumper @newarr; =head OUTPUT $/ is a string, not a regex, so, using an input_separator without any regex metachar Disgarding FOO (ie, $arr[1]) $VAR1 = 'abcdefghi'; $VAR2 = 'jkl-123-'; $VAR3 = 'mnopqrstu'; $VAR4 = 'vwxyz'; =cut __DATA__ FOO0 abc def ghi FOO1 jkl -123- FOO2 mno pqr stu FOO3 vwxy z [download] Of course, it's also possible that this has no bearing on your problem... :-(*	[reply] [d/l]
Re: Usage of regular expressions in input separator by TJPride (Pilgrim) on Dec 30, 2011 at 16:06 UTC
You could read in the whole file and regex on that, but I'm assuming that's not something you want to do. You could read in chunks and look for the separator that way, but what if the separator crosses the chunk barrier? For instance, if you're matching on Separator \d+ and the barrier splits it into Separator 2\|3 instead of Separator 23. That's no good. Lastly, if your file is in multiple lines, this is fairly easy using a line-by-line technique: `use strict; use warnings; my ($data, @records); open (FH, 'data.txt') \|\| die; while (<FH>) { $data .= $_; push @records, $1 while $data =~ s/(.*?)Separator \d+//s; } push @records, $data; use Data::Dumper; print Dumper(\@records);` [download] Data: `Record A Separator 9 Record B Separator 10 Record C Separator 11 Record D` [download] Output: `$VAR1 = [ 'Record A ', ' Record B ', ' Record C ', ' Record D' ];` [download]	[reply] [d/l] [select]
Re: Usage of regular expressions in input separator by jdrago999 (Pilgrim) on Dec 30, 2011 at 22:31 UTC
It would be slick if we could: `$/ = sub { my ($line) = @_; $line =~ m{Separator\s+\d+}; };` [download]	[reply] [d/l]
Re^2: Usage of regular expressions in input separator by afoken (Chancellor) on Dec 31, 2011 at 13:13 UTC
It would be slick if we could: `$/ = sub { my ($line) = @_; $line =~ m{Separator\s+\d+}; };` [download] Yes, but the whole point of `$/` is to make lines or records from the bits in a file. So there is no "line" before $/ ... You could feed the code ref with chunks of a file, but even that would not be sufficient. Imagine a file with a two-byte record separator (e.g. that old CR-LF from DOS). The first chunk ends with the first byte of the record separator (i.e. CR), the second chunk begins with the second byte of the record separator (i.e. LF). Unless you manage to maintain some state information, you would not be able to detect the record separator. That state information has to be per file handle, or else you mix data from different files. So you can not use global or `state` variables, unless you also pass the handle to the code ref and use it to index arrays or hashes with status data. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^3: Usage of regular expressions in input separator by jdrago999 (Pilgrim) on Jan 04, 2012 at 23:39 UTC
Thanks for giving my "wouldn't it be cool if..." the full treatment. In the meantime, I suppose we'll have to: `open my $ifh, '<', $filename or die "Cannot open '$filename' for reading: $!"; local $/; foreach my $chunk ( split /Separator\s+\d+/, scalar(<$ifh>) ) { # yay chunk! }` [download] Unfortunately this will not do well for very large files. We'd have to check against the regexp as each byte is read into memory. `# I might be way off-base here: no warnings 'uninitialized'; my $pattern = qr{Separator\s\d+}; my $callback = sub { warn "Chunk: @_" }; binmode($ifh); my $offset = 0; my $buffer = ''; while( sysread($ifh, my $byte, 1, $offset++) ) { $buffer .= $byte; if( $buffer =~ $pattern ) { $callback->( $buffer ); $buffer = ''; } }` [download] Even that won't work correctly, and it would be really, really slow. I only wrote it here for the sake of discussion.	[reply] [d/l] [select]