I am using ActiveState Perl in a Windows environment to read and process a file. This file has line terminations in CRLF (carriage return, line feed) format. The problem is that some of the lines contain a rogue LF in them. There is no way I can correct the way this data comes to me.

If the file is simply opened and read using while (<IN>), the lines with a rogue LF are split into two separate lines. This data is therefore not processed and all line counts for further error reporting are thrown off.

To attempt to correct for this, I set the $/ variable to "\x0D\x0A". While processing the files, I then did a quick s/\x0A/ /g to "fix" the data. When tested in Linux, this worked perfectly. Unfortunately, this program must run in Windows.

When I ran the program under ActiveState in a Windows command window, however, I found that the while (<IN>) would try to read the entire file at once, similar to setting $/ to undef. Upon further investigation by commenting out the $/ assignment, I found that each line read by while (<IN>) under this environment was giving me a line terminated by simply LF, with the CR completely removed. The length was one character less than the true line length.

I am not sure when the CR is being removed. I tried opening the file in different modes (currently I am using utf8, since the file contains unicode). I found that under the :raw mode, the while loop would break on lines if $/ was set to "\x0D\x0A". The problem was, it would break on both a CRLF and a LF. This brought me right back to my original problem.

I am aware that the C function read() will remove the CR from a line when reading it in, while the function _read() does not. Could this be the problem that I am dealing with? If so, is there a way to force Perl to use _read(). If I can't force it to do so, does anyone know another way around the read() difficulty. And if this is not the problem at all, but it is something else, could someone give me a helpful point in the right direction?

Thanks!


In reply to Windows file read by thedoe

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.