Process input file

icg has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I want to process and load a huge file (around 2GB). The file is a stream of characters with no line breaks. Every 81st character we have a new value. I was using Text::Wrap module to reformat the file into 80 columnar lines, store the result in another file and then load the reformatted file. This takes a lot of time. So I tried to make use of read in a while loop as follows:

while (read S, $contents, 80)
[download]

which would eliminate the above mentioned step. This works, but if the input line contains a special character like those that appear in French names, the program fails.

Please help.

Thank you,
ICG

Comment on Process input file Download Code

Replies are listed 'Best First'.
Re: Process input file by reasonablekeith (Deacon) on Dec 13, 2005 at 15:14 UTC
from the documentation for read. `Note the characters: depending on the status of the filehandle, either (8-bit) bytes or characters are read. By default all filehandles operate on bytes, but for example if the filehandle has been opened with the ":utf8" I/O layer (see "open", and the "open" pragma, open), the I/O will operate on UTF-8 encoded Unicode characters, not bytes. Similarly for the ":encoding" pragma: in that case pretty much any characters can be read.` [download] Sounds like your problem to me. --- my name's not Keith, and I'm not reasonable.	[reply] [d/l]
Re^2: Process input file by ikegami (Patriarch) on Dec 13, 2005 at 15:26 UTC
In other words, sounds like the OP should binmode the file handle (to ":raw" or to the proper encoding).	[reply]
Re: Process input file by holli (Abbot) on Dec 13, 2005 at 15:11 UTC
Try this: `{ local $/ = \80; open HANDLE, "<", "largefile"; while (<HANDLE>) { print "$_\n"; } close HANDLE; }` [download] holli, /regexed monk/	[reply] [d/l]
Re: Process input file by CountOrlok (Friar) on Dec 13, 2005 at 15:11 UTC
if the 80th character is always the same unique character, e.g. "^" you can use: $/ = "^"; and then read in the file as you would a normal line file. Otherwise read in the file 80 chars at a time with read() and scan the input for non-ascii characters. If you find, for example, 2 non ascii chars, read in two more chars (and check again for more non-ascii chars). -imran	[reply]