neodymium has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file where sentences have been wrapped around to the next line and the space between the words has been removed... I'm attempting to fix it (append the wrapped line to the previous line where applicable) using the following code
#!perl.exe my $file="infile.txt"; //stick infile in $contents open (FILE, $file) or die "Can't open $file: $!\n"; select((select(FILE), $/ = undef)[0]); my $contents = <FILE>; close (FILE); $contents =~ s/(\w)\r\n/($1) /mg; //attempt to fix wrapped //sentences open(OUTFILE, "<outfile.txt"); //stick result in outfile print OUTFILE $contents; close OUTFILE;
what I'm hoping for is that stuff like this:
this is a wrapped line. "This is a wrapped quote." this first line is fine. so is this one.
will be changed to:
this is a wrapped line. "This is a wrapped quote." This first line is fine. So is this one.
but its not at all working... the output file always seems to contain no characters at all... take pity on me... and help me fix this dismal kludge before I sleep. Thank you in advance.
-----NOTE---
GAK I'm a moron... code is fixed now... and I'm going to sleep...

Replies are listed 'Best First'.
Re: replacement of newlines
by McDarren (Abbot) on Apr 14, 2006 at 07:10 UTC
    If you assume that a line can only end with a period (.) or a double-quote ("), then the following seems to give the desired output:
    #!/usr/bin/perl -w use strict; undef $/; my $text = <DATA>; $text =~ s/([^."])\n/$1 /g; print "$text"; __DATA__ this is a wrapped line. "This is a wrapped quote." this first line is fine. so is this one.
    Prints:
    this is a wrapped line. "This is a wrapped quote." this first line is fine. so is this one.
    Cheers,
    Darren :)
Re: replacement of newlines
by spiritway (Vicar) on Apr 14, 2006 at 08:00 UTC

    First of all, I can't even see how you managed to get this code to run. You've got C++ comments in there. Assuming this was a typo caused by sleep deprivation, we move on to the next issue:

    use warnings;

    use strict;

    Had you done that, you'd have found that your output file wasn't open when you tried to write to it. This could have saved you some tearing out of hair. So use something like open(OUTFILE, ">", "outfile.txt");

    Next we've got the regex. You were on the right track, but you were trying to do too much. All you needed for the newline marker was \n. However, there were a couple of other problems. The parentheses around the $1 would be added to your text; you need to eliminate them. Finally, the regex was slightly off. Basically, you wanted to capture any characters, up to the newline. If there was a period, you wanted to ignore that (not make the substitution), but you still needed to capture it. So you get:

    $contents =~ s/(.*?[^\.])\n/$1/mg;

    I hope you're able to sleep now... but knowing programmers, you'll probably think to yourself, "just one more little change here...".

    Update: [id://McDarren]'s regex is correct; mine lacks the double-quote, which would create problems with the second item.

Re: replacement of newlines
by japhy (Canon) on Apr 14, 2006 at 15:04 UTC
    select((select(FILE), $/ = undef)[0]); is cargo-cultism. All you need is $/ = undef; (or more safely, local $/;).

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart