bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:

I have a number of text files that I need to fix up and doing this by hand is just way too tedious when Perl exists to do just this. I have a long paragraph and ends sometimes in the middle of a sentence and I need to break the paragraph at the end of a sentence. For example:
This is a sample sentence to format. This is a sample sentence to for +mat. This is a sample sentence to format. This is a sample sentence to format. This is a sample sentence to format. This is a s +ample sentence to format. This is a sample sentence to format.
I have no idea how to go about this but I need to add the incomplete sentence fragment from the previous paragraph to the beginning of the next paragraph.

Any ideas on how to go about this?

EDIT: I was not really clear on the final output that I needed so am clarifying that here. The help given so far works great except that it produces different output that I need.

Here is what I need the output to look like after processing:

This is a sample sentence to format. This is a sample sentence to for +mat. This is a sample sentence to format. This is a sample sentence + to format. This is a sample sentence to format. This is a sample sentence to for +mat. This is a sample sentence to format.

Replies are listed 'Best First'.
Re: fixing paragraphs
by TedPride (Priest) on Jun 17, 2005 at 07:36 UTC
    You have to remove all white space from the beginning and end of each line. Then if a line break follows a character that is not an end of line (.?!), it is replaced with a space.

    EDIT: Oops, forgot the g flag,and also forget to mask out \n itself. Fixed the problem.

    EDIT: Oops again. I really must have been tired when I wrote this! As jwest points out via PM, I was only removing white space from the beginning OR end of the lines, not both at the same time. Fixed the problem.

    use strict; use warnings; @_ = <DATA>; s/^\s+|\s+$//g for @_; $_ = join "\n", @_; ##### END OF LINE MARKERS s/([^\.\?!\n])\n+/$1 /g; ##### ADD TO AS NECESSARY print; __DATA__ This line breaks in the middle. This one doesn't. Neither does this! Or this? I hope.
      This actually works great. But, I should have been more clear in my original post the output format that I needed. I am editing it now to add the final output that I am looking for. As I do not understand the code you provided except to say that it works, I do not know how to alter it to do what I want.
Re: fixing paragraphs
by sh1tn (Priest) on Jun 17, 2005 at 06:39 UTC
Re: fixing paragraphs
by mda2 (Hermit) on Jun 17, 2005 at 18:57 UTC
    You can use Text::Wrap to do it.

    Sample:

    --
    Marco Antonio
    Rio-PM

      That is not a bad idea and I will remember this for the future.

      But, the text for format is several long paragraphs where I just want to join the incomplete sentences together. I do not need to format or wrap the text as I am going to end up importing this into Latex/Lyx for formatting the final output.

Re: fixing paragraphs
by graff (Chancellor) on Jun 20, 2005 at 02:18 UTC
    Based on your update, it looks like you are trying to do two things: remove a paragraph break when it occurs in the middle of a sentence, and add a paragraph break somewhere else where there wasn't one originally.

    If that's the case, you need some sort of rule for saying where paragraph breaks need to be added. (The rule for breaks that need to be removed is clear enough.) Maybe the rule is more like a single move rather than a delete plus an insert? That is, if a break is found in mid-sentence, move it to the end of that sentence. This should also be simple.

    Here's a slightly different approach, that uses the special perl variable $/ (input record separator) to read a whole paragraph at a time, assuming that paragraph breaks are consistently marked by one or more blank lines:

    #!/usr/bin/perl use strict; my $Usage = "Usage: $0 filename.txt > fixed.txt\n"; die $Usage unless ( @ARGV == 1 and -f $ARGV[0] ); $/ = ''; # empty string means blank lines mark end-of-record (cf. per +ldoc perlvar) my @pars = <>; # read all paragraphs into @pars my $sterm = qr/[.!?][)"']*/; # regex for end-of-sentence for ( my $i = 0; $i < $#pars; $i++ ) # skip last paragraph { next if ( $pars[$i] =~ /$sterm\s*$/ ); # get here when paragraph ends in mid-sentence my $j = $i + 1; # refer to next par for tail part of sentence ( my $tail ) = ( $pars[$j] =~ /(.*?$sterm)\s*/ ); $pars[$i] =~ s/\s*$/ $tail\n\n/; # add tail to current par $pars[$j] =~ s/\Q$tail\E\s*//; # remove it from next par } print @pars;
    (Note that the end-of-sentence pattern allows for "quoted and/or parenthesized sentences.")
      Actually, that is precisely what I am looking for. The help of the monks here is pretty stellar and your time is very much appreciated!