thedoe has asked for the wisdom of the Perl Monks concerning the following question:

I am using ActiveState Perl in a Windows environment to read and process a file. This file has line terminations in CRLF (carriage return, line feed) format. The problem is that some of the lines contain a rogue LF in them. There is no way I can correct the way this data comes to me.

If the file is simply opened and read using while (<IN>), the lines with a rogue LF are split into two separate lines. This data is therefore not processed and all line counts for further error reporting are thrown off.

To attempt to correct for this, I set the $/ variable to "\x0D\x0A". While processing the files, I then did a quick s/\x0A/ /g to "fix" the data. When tested in Linux, this worked perfectly. Unfortunately, this program must run in Windows.

When I ran the program under ActiveState in a Windows command window, however, I found that the while (<IN>) would try to read the entire file at once, similar to setting $/ to undef. Upon further investigation by commenting out the $/ assignment, I found that each line read by while (<IN>) under this environment was giving me a line terminated by simply LF, with the CR completely removed. The length was one character less than the true line length.

I am not sure when the CR is being removed. I tried opening the file in different modes (currently I am using utf8, since the file contains unicode). I found that under the :raw mode, the while loop would break on lines if $/ was set to "\x0D\x0A". The problem was, it would break on both a CRLF and a LF. This brought me right back to my original problem.

I am aware that the C function read() will remove the CR from a line when reading it in, while the function _read() does not. Could this be the problem that I am dealing with? If so, is there a way to force Perl to use _read(). If I can't force it to do so, does anyone know another way around the read() difficulty. And if this is not the problem at all, but it is something else, could someone give me a helpful point in the right direction?

Thanks!

Replies are listed 'Best First'.
Re: Windows file read
by ikegami (Patriarch) on May 01, 2006 at 16:18 UTC
    To attempt to correct for this, I set the $/ variable to "\x0D\x0A". While processing the files, I then did a quick s/\x0A/ /g to "fix" the data. When tested in Linux, this worked perfectly. Unfortunately, this program must run in Windows.

    That will work in Windows if (and only if) you binmode IN first. For example:

    local $/ = "\x0D\x0A"; open(local *IN, '<', $filename) or die("Unable to open input file $filename: $!\n"); binmode(IN); while (<IN>) { chomp; s/\x0A/ /g; ... }

    Without binmode, occurances of "\x0D\x0A" are converted to "\x0A" before Perl looks for the line ending.

      While this did preserve the CRLF at the end of each line, I still ran into the same problem as when I opened the file simply doing:

      open IN, '<:raw', $file or die "Unable to open file: $file";

      This problem is that, despite setting $/ to "\x0D\x0A", the lines with a rogue LF (\x0A) are still split into two lines. I have used a hex editor to make sure that ONLY a LF is present, so I am not misreading the data.

      Sadly, this puts me right back to the beginning problem. I appreciate all your assistance so far ikegami. I now know about layering the file modes, however unfortunately my line break problem still exists.

        It works for me. Show your code, please. Mine is the following:
        { open(my $fh, '>:raw', 'file') or die("open>: $!\n"); print $fh ("abc\x{0D}\x{0A}de\x{0A}fg\x{0D}\x{0A}"); # [------5------][---------7----------] # [------5------][---3--][------4-----] } { open(my $fh, '<:raw', 'file') or die("open<: $!\n"); local $/ = "\x0D\x0A"; print length, "\n" while <$fh>; }
        outputs
        5 7

        If you remove the assignment to $/, the output is

        5 3 4

        Oddly enough, when I extracted the hex information to a simple text file containing only a problem line and one line before and after it, the same code I was having a problem with worked.

        I am now re-running on the much larger, original file. If I run into another problem now, though, I will know that there must be some type of extra character which, for some reason or another, is not being reported by my hex editor.

        Thank you again for your help ikegami. I will update this post with the results of the larger run.

        Update: Unfortunately, the source file I have is still giving me this problem. I am looking into what could be doing this. I have extracted the lines around it into a temporary file, but do not have this problem with the temp file. It seems to only happen in the main source. Thank you again for your help, as I now know where I need to look for the (hopeful) solution to my dilemma.

        Update 2: Wow...after spending two days on this, I have just learned that someone modified the input file after I looked at it in hex to include a true line break in that position. Why? I have no idea. But the mystery has finally been solved. Thank you again to ikegami for pointing me back towards where I had been looking. At least now I know I'm not too crazy

      I am currently opening the file using:

      open IN, '<:utf8', $file or die "Can not open input file: $file";

      It was my understanding that specifying a format while opening the file will open it, then call binmode on it. Is this incorrect?

        I'm not very familiar with PerlIO (:utf8, etc). I suspect that if you do

        open IN, '<:utf8', $fn or ...; binmode(IN); # Short for binmode(IN, ':raw') in v5.8

        you will lose the :utf8 property. You could try

        open IN, '<:raw:utf8', $fn or ...;

        but :raw and :utf8 might be mutually exclusive. Fortunately, it's easy to try these and see if they work.

        Update: This page says the previous snippet will work. Your code would look like:

        local $/ = "\x0D\x0A"; open(local *IN, '<:raw:utf8', $filename) or die("Unable to open input file $filename: $!\n"); while (<IN>) { chomp; s/\x0A/ /g; ... }