When is an ^$ not a ^$ ?

Clownburner has asked for the wisdom of the Perl Monks concerning the following question:

OK, I've searched for this one and I'm coming up blank, so I submit to the Wiser Ones for help.

I am trying to split a text file into many smaller files, breaking it down by blank lines. I have isolated the problem to a split function, but I'm baffled why this wouldn't work.

The function looks like this:

{
open (FILE,"<$input_file") or die "Couldn't open file $input_file, $!\
+n";
local $/ = undef;
$slurp = <FILE>;
close FILE;
}
@data = split (/\n{2,}/,$slurp);
[download]

Once I've got the data into an array, the rest is cake and works fine.

Now, if I use an input file I create on a *NIX machine, it works flawlessly as written. But, if the input file comes from a Windows machine, it doesn't work.

Now before you go clicking "--" and screaming about newline conversion, let me tell you what I've tried...

If I first run the file through VI and do a :%s/^$/-==-/g and then change the SPLIT to cut on that string, it works fine. If I change the SPLIT to /^$/ it doesn't work, everything ends up in @data[0]. I've also tried:

split (/^\s+$/,$slurp);
split (/\r\n\r\n/,$slurp);
split (/\0x0D\0x0A/,$slurp);
split (/(?:\cM\cJ){2,}/,$slurp);
split (/^\s+$|^\n+$|^$/,$slurp);

...And a few others too desperate to mention. None of these have any affect on this intractable file, the whole file always ends up in @data[0]. My CTS is acting up, so I'm probably missing something really obvious here, or there's some not-well-documented way to accomplish this on DOS/Windows text files that I can't figure out, but the really confusing thing is that vi can do it easy and Perl can't. Any clues???

Signature void where prohibited by law.

2001-04-05 Edit by Corion : Corrected title

Comment on When is an ^$ not a ^$ ? Select or Download Code

Replies are listed 'Best First'.
Re: When is an code^$/code not a code^$/code? by dws (Chancellor) on Apr 05, 2001 at 04:27 UTC
You can further simplify by reading the description of `$/` in perlvar.. Therein you'll discover the magic of setting it to "", which allows 1 or more blank lines to delimit paragraphs, allowing you to write: `open(FILE,"<$input_file") or die "$input_file: $!\n"; { local $/ = ""; @data = <FILE> } close(FILE);` [download]	[reply] [d/l] [select]
Re: When is an code^$/code not a code^$/code? by chromatic (Archbishop) on Apr 05, 2001 at 04:39 UTC
I'd go with dws on this one, $/ is a nice thing to be localizing, and often. On the other hand, there are a couple of other approaches that are possible in this case. The first is to use my super-friend, the transliteration operator: `$file =~ tr/\r//d;` The other idea that comes to mind is a more complicated regex as an argument to split. I don't see this used often, which is a pity: `my @lines = split(/(?:\r?\n){2,}/, $file);` Take these with a grain of salt, because they could require some tweaking depending on your circumstances. Just remember that hammers in Perl usually have GPS and homing as well.	[reply] [d/l] [select]
Re: Re: When is an code^$/code not a code^$/code? by Clownburner (Monk) on Apr 05, 2001 at 06:10 UTC
Another great suggestion! I tried both of those, but neither worked at all. Same results as before - everything got pushed into `$data[0]`. :-( I also tried dws' suggestion, with no results. I don't know what's wrong with this evil, twisted text file, but I can tell you it's got me up to HERE right now. And yet, 5 seconds in vi fixes it perfectly. I'm baffled. A hex editor sees this where the newlines would be: `0d 0a 0d 0a` Any clues? Signature void where prohibited by law.	[reply] [d/l] [select]
Re: Re: Re: When is an code^$/code not a code^$/code? by jeroenes (Priest) on Apr 05, 2001 at 13:21 UTC
I'm pretty baffled that all these things don't work. Can you put your problem files online maybe? 0d is hex for \r, 0a is hex for \n. I made myself a testfile by: `perl -e 'print "aa\r\nab\r\n\r\naa\nbb\n"' >test` Than I tried to get items out from it by: `perl -e '$/="\r\n\r\n"; print "\tMy item: $_" while (<>);' <test` This gave me the following output: `My item: aa ab My item: aa bb` [download] So that seems to work. I don't see why this would work and the others not, so I tried another one: `perl -e 'undef $/;@a=split(/\r\n\r\n/,<>); print join "\t My item: ", @a;' <test` Which gave a similar result. There must be something special to your case. Hope this helps a bit, Jeroen "We are not alone"(FZ)	[reply] [d/l] [select]
Re: Re: Re: Re: When is an code^$/code not a code^$/code? by Clownburner (Monk) on Apr 05, 2001 at 21:47 UTC
Re:{5} When is an code^$/code not a code^$/code? by jeroenes (Priest) on Apr 06, 2001 at 00:02 UTC
(tye)Re: When is an ^$ not a ^$ ? by tye (Sage) on Apr 05, 2001 at 20:17 UTC
One thing I didn't see anyone else cover was that: `split (/^\s+$/,$slurp);` should instead be `split (/^\s+$/m,$slurp);` so that ^ and $ can match at line endings in the middle of the string. - tye (but my friends call me "Tye")	[reply] [d/l] [select]
Re: When is an code^$/code not a code^$/code? by voyager (Friar) on Apr 05, 2001 at 04:53 UTC
You might want to look at this node to see the help I got on the carriage return / line feed / new line trickiness.	[reply]
Re: When is an ^$ not a ^$ ? by Xxaxx (Monk) on Apr 05, 2001 at 12:41 UTC
If the existence of the \r in the Windows files is the only thing buggering the works up then simply removing the \r should do the trick. `$slurp =~ s/\r//g;` [download] or (from chromatic): `$slurp =~ tr/\r//d;` [download] Do this right after you "slurp" the contents in. Then all should run the same for Linux flavors and Windows flavored files. Claude	[reply] [d/l] [select]
Re: When is an code^$/code not a code^$/code? by ton (Friar) on Apr 05, 2001 at 04:21 UTC
Don't make things harder than they have to be! Just do this: `open (FILE,"<$input_file") or die "Couldn't open file $input_file, $!\ +n"; while (<FILE>) { chomp; if ($_) { $entry .= "$_\n"; } else { push @data, $entry; $entry = ""; } } close FILE;` [download]	[reply] [d/l]
Re: Re: When is an code^$/code not a code^$/code? by Clownburner (Monk) on Apr 05, 2001 at 06:02 UTC
I dunno about this. When I tried it that way, the first blank line I hit knocked me out of my while loop... Signature void where prohibited by law.	[reply]
Re: When is an ^$ not a ^$? by Clownburner (Monk) on Apr 05, 2001 at 04:15 UTC
Sorry about the weird title - preview behaves differently than submit in this particular case. :-P Signature void where prohibited by law.	[reply]