tom2112 has asked for the wisdom of the Perl Monks concerning the following question:

OK, I've been away from Perl programming for some time - and I was never that experienced with it anyway. Well, I came across a situation that seems to be begging for some Perl code to solve it, and I can't remember how to do what I was thinking. Maybe some kind monk here can point me in the right direction.

The problem is I have a ton of text files (ebooks, but in plain text) that need cleaned up. The biggest problem with the files is that the text is not wrapped, it contains new line characters at the end of every 60 or 70 characters or so. This makes the "paragraphs" in the text actually a collection of single lines. Also the paragraphs are usually separated by a couple new line characters in a row.

I would like to do a search and replace on the file as a whole, removing all single new line characters.

I recall that there is a way to tell Perl to not look at the new line character as the end of a record, so that the whole file could be globbed into one string for processing, then written back out easily. But I can't remember where I read that, or what it was called.

Does anyone know what I'm talking about?

I'm also game for hearing suggestions on how best to approach the logic of this code. (I'm not asking people to write it - just for tips on good tactics to use to approach it.)

Thanks in advance,
Tom
  • Comment on Can't remember the term to search for help on!

Replies are listed 'Best First'.
Re: Can't remember the term to search for help on! (slurp)
by toolic (Bishop) on Dec 04, 2009 at 17:59 UTC
    I recall that there is a way to tell Perl to not look at the new line character as the end of a record, so that the whole file could be globbed into one string for processing, then written back out easily. But I can't remember where I read that, or what it was called. Does anyone know what I'm talking about?
    Slurp. Read about the INPUT_RECORD_SEPARATOR variable: $/

    Once you have your slurped input, if you need to output as paragraphs, Text::Wrap may be helpful.

    Update: If you don't want to muck with $/, you could use the slurp function from File::Slurp.

Re: Can't remember the term to search for help on! (paragraph mode)
by shmem (Chancellor) on Dec 04, 2009 at 18:15 UTC

    See perlrun, there "paragraph mode". The boundary between paragraphs in that mode may comprise several blank lines.

    Given a file named "text" as follows

    this is the first paragraph. It has three lines. Of which this one is the third. A blank line denotes a paragraph. This one has got two lines only. The third paragraph is preceded with several blank lines, and it has itself three, no, wait, four lines. Last paragraph. All paragraphs should be shown each as one line running "perl -p00 -le 's/\n/ /gs;s/\s+/ /g;'" on that file, with multiple blanks condensed into one.

    the snippet

    perl -p00 -le 's/\n/ /gs;s/\s+/ /g;' text

    does what you want. If you want inplace-edit (see perlrun again), say

    perl -p00 -i.bak -le 's/\n/ /gs;s/\s+/ /g;' text

    to have the file backed up with the suffix .bak as text.bak

    You can provide multiple files on the command line; each will be processed in turn (and backed up, if requested).

      Thank you to all you kind Perl Monks!! All of the above are great solutions. The INPUT_RECORD_SEPARATOR was what was trying to recall, but you've all offered good solutions. I've only had a chance to try the last solution, and it works like a charm. I didn't realize I could do search and replaces direct from the command line like that. That's awesome! Thanks again!

        For anyone else that needs to cleanup poorly formatted ebooks in text files, here's what I came up with from the help I received above:


        perl -p00 -i.bak -le "s/-\n//gs;s/([^!\?\.\"\'\`])\n/\1 /gs;" myfile.txt

        This will remove newline characters at the end of lines that do NOT end in a period, question mark, exclamation point or some form of quote. However, prior to removing those newline characters, it removes any newline character preceded by a hyphen as well as removing the hyphen.

        It works great. Thanks Perl Monks!

Re: Can't remember the term to search for help on!
by ikegami (Patriarch) on Dec 04, 2009 at 17:59 UTC

    You're talking about undefined $/. Reading a whole file at once is called "slurping the file" or "slurping the file's content".

    my $file; { local $/; $file = <>; } $file =~ s/(?<!\n)\n(?!\n)//g print $file;

    Usage for above:

    perl script.pl infile > outfile perl -i.bak script.pl file # in-place

    As a one-liner:

    perl -0777pe's/(?<!\n)\n(?!\n)//g' infile > outfile perl -i.bak -0777pe's/(?<!\n)\n(?!\n)//g' file
Re: Can't remember the term to search for help on!
by BioLion (Curate) on Dec 04, 2009 at 18:06 UTC

    You are thinking of 'file slurping' - which is done by modifying the $/ perl line separator (use $/ = undef; to slurp in one go), or using a module like File::Slurp.

    Then you can use a regex to remove single newline characters, but leave alone double ones using a lookahead assertion, or something simple like:

    use strict; use warnings; my $line = "foo\nbar\n\nbaz"; print "Before :\n \'$line\'\n"; #Before : #'foo #bar # #baz' $line =~ s/\n([^\n])/$1/; print "after : \'$line\'\n"; #after : #'foobar # #baz'

    HTH!

    Just a something something...