Perl regex txt file new line (not recognised?)

thurinus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am struggling with text files, which I want to modify with regex. I converted a pdf file to a text file which then contains the following text:

"Inventories

Trade receivables

Assets

This item includes impairment losses."

I now want to remove all lines which are not ending with a period (lines Inventories, Trade receivables, Assets) and have the following code:

use strict;
use warnings;
use File::Copy;

my $destination = "/directory/originalinputfiles/";

my @files = glob ("/directoy/*.txt");


foreach my $file (@files) {

  my $newfile = $file.".tmp";

  open (IN,'<:encoding(UTF-8)', $file) or die "Could not open '$file' 
+$!";
  open (OUT,'>:encoding(UTF-8)', $newfile) or die "Could not open '$ne
+wfile' $!";


    while (my $text = <IN>) {

      $text =~ s/^[a-z].*[a-z]$//gmi;

      print OUT $text;

    }

  close (IN) or die "Could not close input file: '$file' $!";
  close (OUT) or die "Could not close output file '$newfile' $!";


  move $file, $destination or die "Could move the orignal input file: 
+$file to the folder: '$destination' $!";

  rename $newfile, $file or die "Could not rename file: '$newfile' $!"
+;

  print "\n done \n";

}
[download]

If I run this script the first three lines are not identified and removed (Inventories, Trade receivables, Assets). However, adding a paragraph manually in the text file after every new line leads to the identification and removal of these lines (Inventories, Trade receivables, Assets) (which is not an option as I have thousands of text files).

I already tried to add a line after every line with regex, which also does not work

$text =~ s/$/\n/;

Why is my code not working and how can I solve it?

Comment on Perl regex txt file new line (not recognised?) Select or Download Code

Replies are listed 'Best First'.
Re: Perl regex txt file new line (not recognised?) by LanX (Saint) on Jan 21, 2020 at 15:31 UTC
> I now want to remove all lines which are not ending with a period this works for me `use strict; use warnings; while (<DATA>) { print if /\.$/; } __DATA__ Inventories Trade receivables Assets This item includes impairment losses.` [download] --> `This item includes impairment losses.` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^2: Perl regex txt file new line (not recognised?) by thurinus (Initiate) on Jan 21, 2020 at 15:52 UTC
Dear LanX, thanks for the fast reply. With this code as well the lines (Inventories, Trade receiveables, Assets) in my text file are not recognized. Is there something wrong with the textfile and how can I change it?	[reply]
Re^3: Perl regex txt file new line (not recognised?) by LanX (Saint) on Jan 21, 2020 at 16:02 UTC
maybe try also a `chomp $text` in your code. update here several methods to clarify what your file contains. I prefer Data::Dump but you might not have it installed. `use strict; use warnings; use Data::Dump qw/dd/; use Data::Dumper; while (<DATA>) { chomp; print "<$_>\n" if /\.$/; dd $_; warn Dumper $_; } __DATA__ Inventories Trade receivables Assets This item includes impairment losses.` [download] with chomp `$VAR1 = 'Inventories'; $VAR1 = ''; $VAR1 = 'Trade receivables'; $VAR1 = ''; $VAR1 = 'Assets'; $VAR1 = ''; $VAR1 = 'This item includes impairment losses.'; "Inventories" "" "Trade receivables" "" "Assets" "" <This item includes impairment losses.> "This item includes impairment losses."` [download] without chomp `$VAR1 = 'Inventories '; $VAR1 = ' '; $VAR1 = 'Trade receivables '; $VAR1 = ' '; $VAR1 = 'Assets '; $VAR1 = ' '; $VAR1 = 'This item includes impairment losses. '; "Inventories\n" "\n" "Trade receivables\n" "\n" "Assets\n" "\n" <This item includes impairment losses. > "This item includes impairment losses.\n"` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^3: Perl regex txt file new line (not recognised?) by LanX (Saint) on Jan 21, 2020 at 15:57 UTC
> the lines (Inventories, Trade receiveables, Assets) in my text file are not recognized. What do you mean "not recognized", are they are printed? What happens if you run my code? What happens if you copy my DATA lines into a file? My guess is that your input is different to what you think. Do you probably have weird line endings? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]
Re^3: Perl regex txt file new line (not recognised?) by BillKSmith (Monsignor) on Jan 21, 2020 at 20:41 UTC
Even if your disc file is exactly what you expect, you must fix your regex to match and remove the newlines from the unwanted lines. `use strict; use warnings; my $file = \do{my $disk_file = "Inventorier\n" ."Trade receivables\n" ."Assets\n" ."This item includes impairment losses.\n" ; }; open (IN,'<:encoding(UTF-8)', $file) or die "Could not open '$file'$!" +; OUT = stdout; while (my $text = <IN>) { #$text =~ s/^[a-z].[a-z]$//gmi; $text =~ s/^[a-z].[a-z]\n$//i; print OUT $text; } close (IN) or die "Could not close input file: '$file' $!";` [download] Note that I have not made any other changes to your posted code (except to remove the looping and the output files.) Even though this should 'work', I prefer LanX's first suggestion. Bill	[reply] [d/l]
Re: Perl regex txt file new line (not recognised?) by haukex (Archbishop) on Jan 21, 2020 at 19:00 UTC
I can confirm that your code works fine for me on Linux when the input file has LF line endings, but not when it has CRLF line endings. So I am guessing that is the problem, to confirm, see the Basic debugging checklist and use e.g. Data::Dumper with `$Data::Dumper::Useqq=1;` turned on. If CRLF is indeed the issue, there are a couple of ways to fix this: You could add the `:crlf` PerlIO layer when reading the files, as in `open (IN,'<:raw:encoding(UTF-8):crlf', $file)` (note I added the `:raw` as technically, the CRLF conversion should happen after the decoding, although that shouldn't really be a problem with UTF-8 I believe); you could convert the files before processing using e.g. `fromdos` from Tofrodos; or you could change the regex to adapt, as in `s/^[a-z].*[a-z]\r?$//gmi` (wouldn't be my preferred solution, but TIMTOWTDI).	[reply] [d/l] [select]
Re^2: Perl regex txt file new line (not recognised?) by LanX (Saint) on Jan 21, 2020 at 19:18 UTC
Another way would be to use the `$INPUT_RECORD_SEPARATOR` Like this he could adapt to any ending. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]
Re^2: Perl regex txt file new line (not recognised?) by BillKSmith (Monsignor) on Jan 22, 2020 at 15:45 UTC
haukex, I suspected the record-separator issue so I wrote my test case (above) to allow me to experiment with them. I duplicated the original problem when I thought that I was simulating newline separators and the original regex and concluded that the problem was with the regex. Your comments and the OP's conclusions strongly suggest that I was wrong. Can you explain how my memory file fails to simulate the test case you used to confirm the original code? OOPS, Simulation is correct. The problem is not duplicated. "Deleted" lines are replaced with a newline. This may be intentional. Bill	[reply]
Re^3: Perl regex txt file new line (not recognised?) by haukex (Archbishop) on Jan 22, 2020 at 16:58 UTC
Can you explain how my memory file fails to simulate the test case you used to confirm the original code? If you change the `\n`'s (LF) in your `$disk_file` to `\r\n` (CRLF), you should see the problem that looks like what the OP was describing.	[reply] [d/l] [select]
Re: Perl regex txt file new line (not recognised?) by kcott (Archbishop) on Jan 22, 2020 at 06:20 UTC
G'day thurinus, Welcome to the Monastery. As already indicated, it's likely you have line terminators not matching the '`$`'. You may have more success using '`\R`', the generic newline. See perlrebackslash: \R and note you'll need at least Perl v5.10 to use this. Here's a quick-and-dirty example to compare '`$`' with '`\R`': `$ perl -E ' my @x = ( "A\n", "B.\n", "C\r", "D.\r", "E\r\n", "F.\r\n", "G\f", "H.\f" ); say q{* With $ }; say "\|$_\|" for grep /\.$/, @x; say q{ With \R }; say "\|$_\|" for grep /\.\R/, @x; ' With $ * \|B. \| * With \R * \|B. \| \|D. \|F. \| \|H. \|` [download] In case you didn't know, the '`\f`' is a form-feed. — Ken	[reply] [d/l] [select]
Re^2: Perl regex txt file new line (not recognised?) by Anonymous Monk on Jan 22, 2020 at 08:37 UTC
Dear Monks, the line terminators were the issue and I could fix it with your help and suggestions. Thanks a lot for your help!	[reply]

update