icg has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I want to process and load a huge file (around 2GB). The file is a stream of characters with no line breaks. Every 81st character we have a new value. I was using Text::Wrap module to reformat the file into 80 columnar lines, store the result in another file and then load the reformatted file. This takes a lot of time. So I tried to make use of read in a while loop as follows:
while (read S, $contents, 80)

which would eliminate the above mentioned step. This works, but if the input line contains a special character like those that appear in French names, the program fails.

Please help.


Thank you,
ICG

Replies are listed 'Best First'.
Re: Process input file
by reasonablekeith (Deacon) on Dec 13, 2005 at 15:14 UTC
    from the documentation for read.
    Note the *characters*: depending on the status of the filehandle, either (8-bit) bytes or characters are read. By default all filehandles operate on bytes, but for example if the filehandle has been opened with the ":utf8" I/O layer (see "open", and the "open" pragma, open), the I/O will operate on UTF-8 encoded Unicode characters, not bytes. Similarly for the ":encoding" pragma: in that case pretty much any characters can be read.
    Sounds like your problem to me.
    ---
    my name's not Keith, and I'm not reasonable.
      In other words, sounds like the OP should binmode the file handle (to ":raw" or to the proper encoding).
Re: Process input file
by holli (Abbot) on Dec 13, 2005 at 15:11 UTC
    Try this:
    { local $/ = \80; open HANDLE, "<", "largefile"; while (<HANDLE>) { print "$_\n"; } close HANDLE; }


    holli, /regexed monk/
Re: Process input file
by CountOrlok (Friar) on Dec 13, 2005 at 15:11 UTC
    if the 80th character is always the same unique character, e.g. "^"
    you can use:
    $/ = "^";
    and then read in the file as you would a normal line file.

    Otherwise read in the file 80 chars at a time with read() and scan the input for non-ascii characters. If you find, for example, 2 non ascii chars, read in two more chars (and check again for more non-ascii chars).

    -imran