PhilFromIndy has asked for the wisdom of the Perl Monks concerning the following question:

I've got a 680+ MB fixed width text file. Normally no big deal, but this file is one long line. I want to put a newline at the end of every record so I can deal with this monster in a reasonable manner. Each record is 164 characters long; I wrote the following script to put in the newlines, but after running for several hours it just gave me a "Out of memory!" error and provided no other output.
open(INPUT,"filename.txt") or die "Will not open input: $!"; open(OUTPUT,">output.txt"); or die "Will not open output: $!"; my $input = <INPUT>; my $line; my $i; my $length = length($input); for ($i=0; $i<$length; $i=$i+164){ $line = substr($input,$i,164); print "$line\n"; #Output to screen to make sure it works print OUTPUT "$line\n"; }

Is there another way to go about this, or do I need a machine with more available RAM? I ran this on a newish Core 2 Duo with 3+ GB of ram.

Replies are listed 'Best First'.
Re: Dealing with huge text string
by BrowserUk (Patriarch) on Mar 28, 2008 at 13:25 UTC

    Read it one record at a time:

    open(INPUT,"filename.txt") or die "Will not open input: $!"; open(OUTPUT,">output.txt") or die "Will not open output: $!"; local $/ = \164; while( <INPUT> ) { ## Updated per jwkrahn's post below print OUTPUT "$_\n"; } close OUTPUT; close INPUT;

    BTW: The semicolon after open(OUTPUT,">output.txt"); is kinda a givaway that you didn't have the error checking :)

    As a one-liner:

    perl -ple"BEGIN{$/=\164}" filename.txt >output.txt

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Change  while( <IN> ) { to  while( <INPUT> ) { and it may work better.

      Took a few seconds to run, thanks!
      The superfluous semicolon ended up in the opening of the output file while I was changing the names of the files to protect the innocent.
      That will certainly fail in many circumstances. See Ikegami's post and my example. If you are unlucky enough to break a wide character in half with 164 byte reads, well, that would suck
Re: Dealing with huge text string
by ikegami (Patriarch) on Mar 28, 2008 at 13:54 UTC
    read would be the alternative to setting $/
Re: Dealing with huge text string
by locked_user sundialsvc4 (Abbot) on Mar 28, 2008 at 13:42 UTC

    Yeah, don't forget that “memory” is virtual. In other words, it is a disk-file. So your program tried to copy 168 megabytes from one disk file to another, the hard way. It probably never succeeded in doing just-that.

    There are several ways to do it (of course), but yes, the bottom line is that you need to read n bytes at a time and write each piece out followed by a newline.

Re: Dealing with huge text string
by mobiusinversion (Beadle) on Mar 28, 2008 at 19:16 UTC
    BrowserUK should have a look at Ikegami's post and the perldoc entry on: $/

    If by "each record is 164 characters long", our friend Phil really meant that each record is 164 bytes long, than Browser's solution would be fine. If on the other hand, Phil's file had a wide character in it, (that is, a single logical character that requires more than one byte of storage, for example, the pound sign £ or the trademark sign ™), he'd be smoked.

    The most general way for Phil to feed fixed width fields from a file is as follows.
    use strict; my $length = 164; my $file = 'path/to/filename.txt'; open(my $F, '<:encoding' , $file) or die "cant open $file\n$!\n"; # you will supply the right value for 'encoding'. # one common example is 'utf8' while( read( $F, my $record, $length ) ){ # do something with $record }
    If Phil was sure that his text file contained no wide characters, he could omit the ':encoding' portion of the open mode; read operates on bytes unless otherwise informed by the status of the filehandle in question.

    A related issue:

    To test if an in-memory scalar contains wide-characters, use the bytes pragma and the following trick:
    my $c = 'some_scalar_data'; test_for_wide_chars: { require bytes; if ( bytes::length($c) > length($c) || ($] >= 5.008 && $c =~ /[^\0-\xFF]/)) { print "i found a wide character!" } no bytes; }
      If by "each record is 164 characters long", our friend Phil really meant that each record is 164 bytes long, than Browser's solution would be fine.

      It was

      Six hours of research to find a semantic quibble for a problem that was solved five hours ago?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Ah young Browser, wide characters are no laughing matter!

        Have you ever done any web programming? If so then you'll have run into wide chracters when using HTML entities.

        How about this, try this code and witness the power of wide characters, which really do exist!
        use LWP::UserAgent; open($F, '>:utf8' , 'wide-chars-example.html'); $url = 'http://www.w3schools.com/tags/ref_symbols.asp'; $html = LWP::UserAgent->new()->get($url)->content; print $F $html;
        Now open the newly made file using your method... Try:
        your_method: { local $/ = 2; #open 'wide-chars-example.html', process the 'records' }
        You'll get an interesting surprise!

        Take care!