Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

split file into smaller chunks

by Anonymous Monk
on Jul 17, 2009 at 08:48 UTC ( [id://780962]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!
I have a file with multiple records, that are separated with "//", like the following format:
This is the first chunk. // This is the second chunk. // This is the third chunk. //

Can I split the file into smaller files that contain say 1000 records each?
I have started writing something like:
$/="//\n"; $count_records=0; open OUT1, ">>0_999"; open OUT2, ">>1000_2000"; open OUT3, ">>2001_3000"; while(<>) { $record=$_; $count_records++; }
Would it be efficient to use an "if" clause, like:
"If $count_records<=999 --> print OUT1 $record"<br> "If 1000<=$count_records<=1999 --> print OUT2 $record"<br> "If 2000<=$count_records<=2999 --> print OUT3 $record"<br>
or is there a quicker way to do it?
Thanks!

Replies are listed 'Best First'.
Re: split file into smaller chunks
by davorg (Chancellor) on Jul 17, 2009 at 08:57 UTC

    I don't think this is necessarily quicker. But it only ever has one filehandle open.

    { local $/ = "\\/n"; my $file_no = 1; open my $fh, '>', "file$file_no" or die $!; while (<>) { print $fh, $_; unless ($. % 1000) { close $fh; $file_no++; open $fh, '>', "file$file_no" or die $!; } } }
    --

    See the Copyright notice on my home node.

    Perl training courses

Re: split file into smaller chunks
by moritz (Cardinal) on Jul 17, 2009 at 08:54 UTC
    I don't think that it really matters, because your script will be limited by IO, not by CPU.

    However you can get rid of a bit of duplication for example by writing the file handles to an array, something along these lines:

    use strict; use warnings; use autodie qw(open); my @handles; open $handles[0], '>>', '0_999'; open $handles[1], '>>', '1000_1999'; open $handles[2], '>>', '2000_2999'; $/="//\n"; while (<>) { print { $handles[int(($. - 1)/ 1000)] } $_; }

    (untested).

Re: split file into smaller chunks
by ELISHEVA (Prior) on Jul 17, 2009 at 09:03 UTC

    Just to echo davorg - you don't really need to have more than one file stream open. The advantage of using a loop in the fashion he suggestions is that you can handle an arbitrary number of chunks.

    A side comment as well. I notice you are using the two parameter open and not checking for errors. Best practice is to use a three parameter open and check for errors. Also ">>" appends to the existing contents of the file. If you want to write a file from scratch you should use ">" rather than ">>". See perlopentut. Putting this altogether your open statement should look something like this:

     open(OUT, '>', $file_name) or die "Could not open $file_name: $!";

    Best, beth

      Yes but, if I don't use >>, how will it add the record each time the counter increases?
      Also, I know and I have seen many times the "die -> Can't open the file" check, but I don't really understand what's the use of it. Is there a chance the the file will not be created with the "open" function?

        Concerning > vs >>. Once you open a file, records will be appended to the end of a file whether you use >> or >. The difference between the two only affects what happens to records already in the file. >> will preserve the current file contents and only add new records to the end. If you use >> and run your program a second time, the file "0-999" will have 2000 records in it. Each record from 0-999 will have two copies: the records 0-999 from the first run and then again at the end the same 1000 records repeated a second time. > avoids this problem by clearing out the old contents. It lets you start the file fresh as if it had been created for the very first time.

        Concerning die.Yes, there is a chance that the file will not be created. Here are some typical reasons:

        • You may not have permission to create files in the directory where you want to place the file.
        • The disk you want to store the file on may be full or you are using an account with a space limits and you have hit your assigned space limit.
        • Someone else is trying to create a file with the same name and the system has "locked" that file name.

        Best, beth

Re: split file into smaller chunks
by Marshall (Canon) on Jul 18, 2009 at 07:10 UTC
    It could be that you are getting way too complicated for your application!

    This is the first chunk. // This is the second chunk. // This is the third chunk. //
    Some strategies:
    1. Build a memory resident structure with the data that you need all in one pass through the data file. Fancy hash table stuctures,etc.
    2. Search the file again and again and let the O/S do the "dirty work". Use regex and just do "something that appears stupid.
    3. create a DB (which is expensive) and then query that DB

    I mean how big is this file? If it is "small: like 250 MB" after the first search it all winds up memory resident anyway. Next searches (even linear) are 10x+ as fast.

    I recommend option(2)..do something stupid and let the O/S do the work. If that is not "fast enough", then start thinking about option 1 or 3.

    If say there are only 5,000 files and total DB size is 500 MB...do something easy... this is actually considered "small"! Don't get complex until you need to do it!

    Update: Anyway you will be amazed at how quickly even a linear regex search works on a huge file once you have done it once before. On Win XP file size < 1GB.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://780962]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-04-24 12:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found