thurinus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am struggling with text files, which I want to modify with regex. I converted a pdf file to a text file which then contains the following text:

"Inventories

Trade receivables

Assets

This item includes impairment losses."

I now want to remove all lines which are not ending with a period (lines Inventories, Trade receivables, Assets) and have the following code:

use strict; use warnings; use File::Copy; my $destination = "/directory/originalinputfiles/"; my @files = glob ("/directoy/*.txt"); foreach my $file (@files) { my $newfile = $file.".tmp"; open (IN,'<:encoding(UTF-8)', $file) or die "Could not open '$file' +$!"; open (OUT,'>:encoding(UTF-8)', $newfile) or die "Could not open '$ne +wfile' $!"; while (my $text = <IN>) { $text =~ s/^[a-z].*[a-z]$//gmi; print OUT $text; } close (IN) or die "Could not close input file: '$file' $!"; close (OUT) or die "Could not close output file '$newfile' $!"; move $file, $destination or die "Could move the orignal input file: +$file to the folder: '$destination' $!"; rename $newfile, $file or die "Could not rename file: '$newfile' $!" +; print "\n done \n"; }

If I run this script the first three lines are not identified and removed (Inventories, Trade receivables, Assets). However, adding a paragraph manually in the text file after every new line leads to the identification and removal of these lines (Inventories, Trade receivables, Assets) (which is not an option as I have thousands of text files).

I already tried to add a line after every line with regex, which also does not work

 $text =~ s/$/\n/;

Why is my code not working and how can I solve it?

Replies are listed 'Best First'.
Re: Perl regex txt file new line (not recognised?)
by LanX (Saint) on Jan 21, 2020 at 15:31 UTC
    > I now want to remove all lines which are not ending with a period

    this works for me

    use strict; use warnings; while (<DATA>) { print if /\.$/; } __DATA__ Inventories Trade receivables Assets This item includes impairment losses.

    -->

    This item includes impairment losses.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Dear LanX,

      thanks for the fast reply. With this code as well the lines (Inventories, Trade receiveables, Assets) in my text file are not recognized. Is there something wrong with the textfile and how can I change it?

        maybe try also a chomp $text in your code.

        update

        here several methods to clarify what your file contains.

        I prefer Data::Dump but you might not have it installed.

        use strict; use warnings; use Data::Dump qw/dd/; use Data::Dumper; while (<DATA>) { chomp; print "<$_>\n" if /\.$/; dd $_; warn Dumper $_; } __DATA__ Inventories Trade receivables Assets This item includes impairment losses.

        with chomp

        $VAR1 = 'Inventories'; $VAR1 = ''; $VAR1 = 'Trade receivables'; $VAR1 = ''; $VAR1 = 'Assets'; $VAR1 = ''; $VAR1 = 'This item includes impairment losses.'; "Inventories" "" "Trade receivables" "" "Assets" "" <This item includes impairment losses.> "This item includes impairment losses."

        without chomp

        $VAR1 = 'Inventories '; $VAR1 = ' '; $VAR1 = 'Trade receivables '; $VAR1 = ' '; $VAR1 = 'Assets '; $VAR1 = ' '; $VAR1 = 'This item includes impairment losses. '; "Inventories\n" "\n" "Trade receivables\n" "\n" "Assets\n" "\n" <This item includes impairment losses. > "This item includes impairment losses.\n"

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        > the lines (Inventories, Trade receiveables, Assets) in my text file are not recognized.
        • What do you mean "not recognized", are they are printed?
        • What happens if you run my code?
        • What happens if you copy my DATA lines into a file?

        My guess is that your input is different to what you think.

        Do you probably have weird line endings?

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        Even if your disc file is exactly what you expect, you must fix your regex to match and remove the newlines from the unwanted lines.
        use strict; use warnings; my $file = \do{my $disk_file = "Inventorier\n" ."Trade receivables\n" ."Assets\n" ."This item includes impairment losses.\n" ; }; open (IN,'<:encoding(UTF-8)', $file) or die "Could not open '$file'$!" +; *OUT = *stdout; while (my $text = <IN>) { #$text =~ s/^[a-z].*[a-z]$//gmi; $text =~ s/^[a-z].*[a-z]\n$//i; print OUT $text; } close (IN) or die "Could not close input file: '$file' $!";

        Note that I have not made any other changes to your posted code (except to remove the looping and the output files.) Even though this should 'work', I prefer LanX's first suggestion.

        Bill
Re: Perl regex txt file new line (not recognised?)
by haukex (Archbishop) on Jan 21, 2020 at 19:00 UTC

    I can confirm that your code works fine for me on Linux when the input file has LF line endings, but not when it has CRLF line endings. So I am guessing that is the problem, to confirm, see the Basic debugging checklist and use e.g. Data::Dumper with $Data::Dumper::Useqq=1; turned on. If CRLF is indeed the issue, there are a couple of ways to fix this: You could add the :crlf PerlIO layer when reading the files, as in open (IN,'<:raw:encoding(UTF-8):crlf', $file) (note I added the :raw as technically, the CRLF conversion should happen after the decoding, although that shouldn't really be a problem with UTF-8 I believe); you could convert the files before processing using e.g. fromdos from Tofrodos; or you could change the regex to adapt, as in s/^[a-z].*[a-z]\r?$//gmi (wouldn't be my preferred solution, but TIMTOWTDI).

      haukex,

      I suspected the record-separator issue so I wrote my test case (above) to allow me to experiment with them. I duplicated the original problem when I thought that I was simulating newline separators and the original regex and concluded that the problem was with the regex. Your comments and the OP's conclusions strongly suggest that I was wrong. Can you explain how my memory file fails to simulate the test case you used to confirm the original code?

      OOPS, Simulation is correct. The problem is not duplicated. "Deleted" lines are replaced with a newline. This may be intentional.

      Bill
        Can you explain how my memory file fails to simulate the test case you used to confirm the original code?

        If you change the \n's (LF) in your $disk_file to \r\n (CRLF), you should see the problem that looks like what the OP was describing.

Re: Perl regex txt file new line (not recognised?)
by kcott (Archbishop) on Jan 22, 2020 at 06:20 UTC

    G'day thurinus,

    Welcome to the Monastery.

    As already indicated, it's likely you have line terminators not matching the '$'. You may have more success using '\R', the generic newline. See perlrebackslash: \R and note you'll need at least Perl v5.10 to use this.

    Here's a quick-and-dirty example to compare '$' with '\R':

    $ perl -E ' my @x = ( "A\n", "B.\n", "C\r", "D.\r", "E\r\n", "F.\r\n", "G\f", "H.\f" ); say q{*** With $ ***}; say "|$_|" for grep /\.$/, @x; say q{*** With \R ***}; say "|$_|" for grep /\.\R/, @x; ' *** With $ *** |B. | *** With \R *** |B. | |D. |F. | |H. |

    In case you didn't know, the '\f' is a form-feed.

    — Ken

      Dear Monks,

      the line terminators were the issue and I could fix it with your help and suggestions.

      Thanks a lot for your help!