Re: Usage of regular expressions in input separator
by Eliya (Vicar) on Dec 30, 2011 at 12:32 UTC
|
No, this is not possible — as mentioned in the docs:
"Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)"
| [reply] |
Re: Usage of regular expressions in input separator
by JavaFan (Canon) on Dec 30, 2011 at 12:51 UTC
|
No, it's not possible.
I expect the reason to be is that you may have to read the entire input stream up to EOF with some patterns - and put them back on the input stream if there's no longer match. (Suppose your delimiter is $/ = /(.).*\1/), and that may be a costly (specially in memory usage) operation -- or your process could just "hang" forever (if it's trying to read all your standard input, or reading from a (bidirectional) pipe or network socket).
I can see the point, but I would be willing to pay the price. Sure, in degenerated cases it would be costly (so, don't do that), in practice, people would use patterns (like the one you gave), that only requires a limited lookahead.
But it's too late in the game to change $/ from a fixed string to a pattern -- not does it seem to be an itch of any of the active porters. So, I don't expect this to change any time soon. | [reply] |
Re: Usage of regular expressions in input separator
by NetWallah (Canon) on Dec 30, 2011 at 14:49 UTC
|
Depending on how complex your requirements are, you may be able to use Stream::Reader to match one of multiple delimiters, and accomplish your task.
Although it does not support regular expressions, in your example case, you could use
map {"Separator $_"} 0..9
as your delimiter list.
"Battle not with trolls, lest ye become a troll; and if you gaze into the Internet, the Internet gazes also into you."
-Friedrich Nietzsche: A Dynamic Translation
| [reply] [d/l] |
Re: Usage of regular expressions in input separator
by hyvatti (Initiate) on Nov 27, 2024 at 07:43 UTC
|
With PerlIO::via you can add a layer that converts whatever you want to line feeds. For example, if you want to accept CR and LF as line feeds:
package PerlIO::via::normeol;
sub FILL
{
my ($obj,$fh) = @_;
my ($c);
my $n = read ($fh, $c, 1);
return undef unless $n;
$c =~ tr/\r/\n/;
return $c;
}
1;
use PerlIO::via::normeol;
open (A, "<:via(normeol)", "foo.bar");
while (<A>) {
...
| [reply] [d/l] |
Re: Usage of regular expressions in input separator
by AnomalousMonk (Archbishop) on Dec 31, 2011 at 19:11 UTC
|
| [reply] [d/l] |
Re: Usage of regular expressions in input separator
by ww (Archbishop) on Dec 30, 2011 at 15:45 UTC
|
Since you, Robin Hood, provided no sample data nor indication of your required output, this is a WAG... but it may provide a workaround... or some ideas for one.
#!/usr/bin/perl
use Modern::Perl;
use Data::Dumper;
#945627 Workaround if the distinction among elements in each data
# segment need not be retained; if retention
# is required, read DATA into a HoA with the
# separator-and-its-following-digit(s) as keys.
say "\n\t \$/ is a string, not a regex," .
"\n\t so, using an input_separator without any regex metachar \n";
$/ = "FOO";
my @newarr;
my @arr = <DATA>;
for my $item(@arr) {
$item =~ s/\n//sg;
if ( $item =~ /^\d+(.+?)(?:FOO)*$/s ) {
my $out = $1;
push @newarr, $out;
} else {
say "\t Disgarding $item (ie, \$arr[1])"; # discarding the in
+itial "FOO" in $arr[1]
}
}
print Dumper @newarr;
=head OUTPUT
$/ is a string, not a regex,
so, using an input_separator without any regex metachar
Disgarding FOO (ie, $arr[1])
$VAR1 = 'abcdefghi';
$VAR2 = 'jkl-123-';
$VAR3 = 'mnopqrstu';
$VAR4 = 'vwxyz';
=cut
__DATA__
FOO0
abc
def
ghi
FOO1
jkl
-123-
FOO2
mno
pqr
stu
FOO3
vwxy
z
Of course, it's also possible that this has no bearing on your problem... :-(
| [reply] [d/l] |
Re: Usage of regular expressions in input separator
by TJPride (Pilgrim) on Dec 30, 2011 at 16:06 UTC
|
You could read in the whole file and regex on that, but I'm assuming that's not something you want to do. You could read in chunks and look for the separator that way, but what if the separator crosses the chunk barrier? For instance, if you're matching on Separator \d+ and the barrier splits it into Separator 2|3 instead of Separator 23. That's no good. Lastly, if your file is in multiple lines, this is fairly easy using a line-by-line technique:
use strict;
use warnings;
my ($data, @records);
open (FH, 'data.txt') || die;
while (<FH>) {
$data .= $_;
push @records, $1
while $data =~ s/(.*?)Separator \d+//s;
}
push @records, $data;
use Data::Dumper;
print Dumper(\@records);
Data:
Record A Separator 9 Record B
Separator 10 Record C Separator 11
Record D
Output:
$VAR1 = [
'Record A ',
' Record B
',
' Record C ',
'
Record D'
];
| [reply] [d/l] [select] |
Re: Usage of regular expressions in input separator
by jdrago999 (Pilgrim) on Dec 30, 2011 at 22:31 UTC
|
$/ = sub {
my ($line) = @_;
$line =~ m{Separator\s+\d+};
};
| [reply] [d/l] |
|
|
It would be slick if we could:
$/ = sub {
my ($line) = @_;
$line =~ m{Separator\s+\d+};
};
Yes, but the whole point of $/ is to make lines or records from the bits in a file. So there is no "line" before $/ ...
You could feed the code ref with chunks of a file, but even that would not be sufficient. Imagine a file with a two-byte record separator (e.g. that old CR-LF from DOS). The first chunk ends with the first byte of the record separator (i.e. CR), the second chunk begins with the second byte of the record separator (i.e. LF). Unless you manage to maintain some state information, you would not be able to detect the record separator.
That state information has to be per file handle, or else you mix data from different files. So you can not use global or state variables, unless you also pass the handle to the code ref and use it to index arrays or hashes with status data.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] [select] |
|
|
Thanks for giving my "wouldn't it be cool if..." the full treatment.
In the meantime, I suppose we'll have to:
open my $ifh, '<', $filename
or die "Cannot open '$filename' for reading: $!";
local $/;
foreach my $chunk ( split /Separator\s+\d+/, scalar(<$ifh>) ) {
# yay chunk!
}
Unfortunately this will not do well for very large files. We'd have to check against the regexp as each byte is read into memory.
# I might be way off-base here:
no warnings 'uninitialized';
my $pattern = qr{Separator\s\d+};
my $callback = sub { warn "Chunk: @_" };
binmode($ifh);
my $offset = 0;
my $buffer = '';
while( sysread($ifh, my $byte, 1, $offset++) ) {
$buffer .= $byte;
if( $buffer =~ $pattern ) {
$callback->( $buffer );
$buffer = '';
}
}
Even that won't work correctly, and it would be really, really slow.
I only wrote it here for the sake of discussion.
| [reply] [d/l] [select] |