Clownburner has asked for the wisdom of the Perl Monks concerning the following question:

OK, I've searched for this one and I'm coming up blank, so I submit to the Wiser Ones for help.

I am trying to split a text file into many smaller files, breaking it down by blank lines. I have isolated the problem to a split function, but I'm baffled why this wouldn't work.

The function looks like this:
{ open (FILE,"<$input_file") or die "Couldn't open file $input_file, $!\ +n"; local $/ = undef; $slurp = <FILE>; close FILE; } @data = split (/\n{2,}/,$slurp);
Once I've got the data into an array, the rest is cake and works fine.

Now, if I use an input file I create on a *NIX machine, it works flawlessly as written. But, if the input file comes from a Windows machine, it doesn't work.

Now before you go clicking "--" and screaming about newline conversion, let me tell you what I've tried...

If I first run the file through VI and do a :%s/^$/-==-/g and then change the SPLIT to cut on that string, it works fine. If I change the SPLIT to  /^$/ it doesn't work, everything ends up in @data[0]. I've also tried:

...And a few others too desperate to mention. None of these have any affect on this intractable file, the whole file always ends up in @data[0]. My CTS is acting up, so I'm probably missing something really obvious here, or there's some not-well-documented way to accomplish this on DOS/Windows text files that I can't figure out, but the really confusing thing is that vi can do it easy and Perl can't. Any clues???


Signature void where prohibited by law.

2001-04-05 Edit by Corion : Corrected title

Replies are listed 'Best First'.
Re: When is an code^$/code not a code^$/code?
by dws (Chancellor) on Apr 05, 2001 at 04:27 UTC
    You can further simplify by reading the description of $/ in perlvar.. Therein you'll discover the magic of setting it to "", which allows 1 or more blank lines to delimit paragraphs, allowing you to write:
    open(FILE,"<$input_file") or die "$input_file: $!\n"; { local $/ = ""; @data = <FILE> } close(FILE);
Re: When is an code^$/code not a code^$/code?
by chromatic (Archbishop) on Apr 05, 2001 at 04:39 UTC
    I'd go with dws on this one, $/ is a nice thing to be localizing, and often.

    On the other hand, there are a couple of other approaches that are possible in this case. The first is to use my super-friend, the transliteration operator:

    $file =~ tr/\r//d;

    The other idea that comes to mind is a more complicated regex as an argument to split. I don't see this used often, which is a pity:

    my @lines = split(/(?:\r?\n){2,}/, $file);

    Take these with a grain of salt, because they could require some tweaking depending on your circumstances. Just remember that hammers in Perl usually have GPS and homing as well.

      Another great suggestion!

      I tried both of those, but neither worked at all. Same results as before - everything got pushed into $data[0]. :-(

      I also tried dws' suggestion, with no results.

      I don't know what's wrong with this evil, twisted text file, but I can tell you it's got me up to HERE right now.

      And yet, 5 seconds in vi fixes it perfectly. I'm baffled.

      A hex editor sees this where the newlines would be:
      0d 0a 0d 0a

      Any clues?


      Signature void where prohibited by law.
        I'm pretty baffled that all these things don't work. Can you put your problem files online maybe?

        0d is hex for \r, 0a is hex for \n. I made myself a testfile by: perl -e 'print "aa\r\nab\r\n\r\naa\nbb\n"' >test Than I tried to get items out from it by: perl -e '$/="\r\n\r\n"; print "\tMy item: $_" while (<>);' <test This gave me the following output:

        My item: aa ab My item: aa bb
        So that seems to work. I don't see why this would work and the others not, so I tried another one:  perl -e 'undef $/;@a=split(/\r\n\r\n/,<>); print join "\t My item: ", @a;' <test Which gave a similar result. There must be something special to your case.

        Hope this helps a bit,

        Jeroen
        "We are not alone"(FZ)

(tye)Re: When is an ^$ not a ^$ ?
by tye (Sage) on Apr 05, 2001 at 20:17 UTC

    One thing I didn't see anyone else cover was that: split (/^\s+$/,$slurp); should instead be split (/^\s+$/m,$slurp); so that ^ and $ can match at line endings in the middle of the string.

            - tye (but my friends call me "Tye")
Re: When is an code^$/code not a code^$/code?
by voyager (Friar) on Apr 05, 2001 at 04:53 UTC
    You might want to look at this node to see the help I got on the carriage return / line feed / new line trickiness.
Re: When is an ^$ not a ^$ ?
by Xxaxx (Monk) on Apr 05, 2001 at 12:41 UTC
    If the existence of the \r in the Windows files is the only thing buggering the works up then simply removing the \r should do the trick.
    $slurp =~ s/\r//g;
    or (from chromatic):
    $slurp =~ tr/\r//d;
    Do this right after you "slurp" the contents in. Then all should run the same for Linux flavors and Windows flavored files. Claude
Re: When is an code^$/code not a code^$/code?
by ton (Friar) on Apr 05, 2001 at 04:21 UTC
    Don't make things harder than they have to be! Just do this:
    open (FILE,"<$input_file") or die "Couldn't open file $input_file, $!\ +n"; while (<FILE>) { chomp; if ($_) { $entry .= "$_\n"; } else { push @data, $entry; $entry = ""; } } close FILE;
      I dunno about this. When I tried it that way, the first blank line I hit knocked me out of my while loop...
      Signature void where prohibited by law.
Re: When is an ^$ not a ^$?
by Clownburner (Monk) on Apr 05, 2001 at 04:15 UTC
    Sorry about the weird title - preview behaves differently than submit in this particular case. :-P
    Signature void where prohibited by law.