Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Oh Wise ones,
I'm looking for an efficient way to split a 1GB log file (text not binary) into more managable chunks ex: 400MB.
This way at least they're easy to open up in an editor.
I'm not sure what would be the most efficient way to approach this as I have 100's of such logs to go through and split up.
Any suggestions?
Thanks

Replies are listed 'Best First'.
Re: File splitting help
by kyle (Abbot) on Jan 20, 2009 at 21:23 UTC

    If you have Unix, the split utility was made for this kind of thing. If not, a little Perl should suffice. What have you tried?

      I have used the split utility, but I wanted to keep it strictly perl as it needs
      to be able to run on windows as well, hence my question.

      Thanks
        So get split (and other Unix utilities) for Windows, MSYS is the one I use.
Re: File splitting help
by monarch (Priest) on Jan 20, 2009 at 21:47 UTC
    You possibly want to investigate the use of the seek or sysseek functions.

    Either way you're going to have to read chunks of data and write chunks of data.

    Suggestion, then: use read or sysread to pull in large chunks of the file at a time, say 64 kilobytes, into a buffer. Keep a counter of what position you are in the file. Write that buffer to your chunk file, however if the counter exceeds your chunk length (e.g. 400MB) then scan backwards for the last newline character using rindex. Flush the initial portion of that buffer, then close your file, and flush the last portion of that buffer to a new chunk file, reset your chunk length counter, and continue.

    Some psuedo-code (this is _not_ Perl):

    while ( ! eof ) { chunklen = 0; chunknum = 0; open( FOUT, ">chunk" . chunknum++ ); # read into buffer, but at end of buffer in case of leftovers while ( len = read( FIN, buffer, 64000, length(buffer) ) ) { if ( chunklen + len > 400MB ) { # got to end of chunk, deal with newline lastnewline = rindex( buffer, "\n" ); if ( lastnewline ) { # flush up to last found newline write( FOUT, substr( buffer, 0, lastnewline ) ); substr( buffer, 0, lastnewline ) = ""; close( FOUT ); last; # skip to next file } else { # flush entire buffer (no newline found) write( FOUT, buffer ); buffer = ""; } } else { # not at end of chunk, just write buffer print( FOUT, buffer ); buffer = ""; chunklen += len; } } # while we've got something to read } # while not at eof of input

    Update: had to ensure read was to end of buffer, close chunk file when done

Re: File splitting help
by davido (Cardinal) on Jan 20, 2009 at 22:45 UTC

    You can set the input record separator ($/) to a byte quantity rather than an EOL character, and read in chunks that way. Then write each chunk into a different file. Refer to perlvar.

    You can use File::Find to get all your filenames, or just readdir. It would be easy to generate filenames along the lines of original_name.1 original_name.2, .3, .4, etc.

    I'm wondering though, if there's not a better solution. Rather than splitting up hundreds of log files into thousands of chunks, why not devise a Perlish solution to scan through the files for specific things you're looking for? It shouldn't matter how big the data set is, as long as you come up with an efficient way of finding what you're looking for within that data set. Are you looking for a particular event? Use Perl to scan your hundreds of files and index where the events are recorded.


    Dave

      Unfortunately the logs come from multiple devices.
      there is not much uniqueness to the things that are being looked for.
      At times to locate a problem a device has encountered, requires to go through the log manually.
      I know parsing the logs for specific things is the way to go, which I'm already doing, however as much as I want to automate the process, some things require manual intervention.
      Obviously after splitting the log into chunks, I remove the original.