coldy has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have a large comma separated file (220937474 characters) that does not have any line breaks. I thought breaking this file into lines of 80 characters would be an easy task for perl,
my $line = <IN>; my @probs = split ',', $line; $i=0; foreach (@probs) { ++$i; if ( $i % 80 != 0){ print OUT "$_"} else{ print OUT "\n"; } }
Firstly, it takes a while to do the split - is there a faster method or is that the best I can get out of perl. Secondly, it doesn't loop through the @prob array. Any ideas where Ive gone wrong? Thanks.

Replies are listed 'Best First'.
Re: break up file with no line breaks
by BrowserUk (Patriarch) on Apr 22, 2009 at 03:04 UTC

    Try this, it'll take around 10 seconds:

    perl -ple"BEGIN{ $/ = \80 }" theBigFile > theNewFile

    If you only want to break lines after a comma, then it's a tad more complicated:

    (Note:one-liner wrapped for posting!)

    perl -e"BEGIN{$/=\80}" -nle"$x.=$_; print $1 while length($x)>80 and $x=~s[(.{1,80},)(?!,)][] +" theBigFile > theNewFile

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: break up file with no line breaks
by ikegami (Patriarch) on Apr 22, 2009 at 02:32 UTC
    Easy task indeed.
    use Text::Wrap qw( wrap ); local $Text::Wrap::columns = 80; print wrap('', '', <IN>);

    Text::Wrap

    Update: I misunderstood the question and missed the magnitude of the file!

    use Text::Wrap qw( wrap ); local $Text::Wrap::columns = 80; local $Text::Wrap::break = qr/(?<=,)/; local $/ = ','; print wrap('', '', $_) while <IN>;

    Downside: Will break on every comma, even those in the middle of a field.

    You can get rid of the Text::Wrap entirely if none of the fields are overly large.

        I noticed and updated the node while you were replying.
Re: break up file with no line breaks
by roboticus (Chancellor) on Apr 22, 2009 at 02:57 UTC
    coldy:

    You can read fixed-size chunks from your file handle. Then do your splitting similar to what you're doing now. Just set your input record size like so:

    local $/ = \500;

    Then you can chop it up into approximately 80 character lines breaking on the commas like you're doing now, and when you need more text (say, when you have less than 100 characters left, read another chunk and append it to your text buffer). Here's an example of how you might approach it. (Note: rather than worry about 80 character lines, I'm just writing 10 values on each line.)

    #!/usr/bin/perl -w use strict; use warnings; local $/ = \500; open IN, '<', 'bigfile.csv' or die $!; my @values; while (my $line = <IN>) { push @values, split /,/, $line; while (@values > 10) { print join(',',@values[0..9]), "\n"; @values = @values[10..@values-10]; } } close IN or die $!;

    ...roboticus

    UPDATE: D'oh! ikegami's solution looks much better... Ah, well. I guess I need to read Text::Wrap now...

Re: break up file with no line breaks
by graff (Chancellor) on Apr 22, 2009 at 06:28 UTC
    If the idea is to insert line-breaks after certain commas, such that the line-breaks end up being about 80 characters apart, something like this would suffice:
    perl -ne 'BEGIN{$/=","} $o.=$_; if(length($o)>=80){ print "$o\n";$o="" + } END{print "$o\n"}' < file.no-breaks > file.with-breaks
    (The particular use of quotes there assumes a bash or similar Bourne-like shell.)

    Then again, when the purpose of a data file is simply to store list, I personally don't see the point of using commas at all. It's much better to use line-breaks as delimiters between list elements, because then you can use lots of handy unix tools to good effect in order to do things like (de)select, count, sort or otherwise process the elements. And going from comma-delimited (with no line-breaks) to line-break delimited is really simple:

    perl -pe 'tr/,/\n/' < comma.delimited > newline.delimited