File splitting help

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: File splitting help by kyle (Abbot) on Jan 20, 2009 at 21:23 UTC
If you have Unix, the `split` utility was made for this kind of thing. If not, a little Perl should suffice. What have you tried?	[reply] [d/l]
Re^2: File splitting help by Anonymous Monk on Jan 20, 2009 at 21:41 UTC
I have used the split utility, but I wanted to keep it strictly perl as it needs to be able to run on windows as well, hence my question. Thanks	[reply]
Re^3: File splitting help by runrig (Abbot) on Jan 20, 2009 at 22:45 UTC
So get split (and other Unix utilities) for Windows, MSYS is the one I use.	[reply]
Re: File splitting help by monarch (Priest) on Jan 20, 2009 at 21:47 UTC
You possibly want to investigate the use of the seek or sysseek functions. Either way you're going to have to read chunks of data and write chunks of data. Suggestion, then: use read or sysread to pull in large chunks of the file at a time, say 64 kilobytes, into a buffer. Keep a counter of what position you are in the file. Write that buffer to your chunk file, however if the counter exceeds your chunk length (e.g. 400MB) then scan backwards for the last newline character using rindex. Flush the initial portion of that buffer, then close your file, and flush the last portion of that buffer to a new chunk file, reset your chunk length counter, and continue. Some psuedo-code (this is _not_ Perl): while ( ! eof ) { chunklen = 0; chunknum = 0; open( FOUT, ">chunk" . chunknum++ ); # read into buffer, but at end of buffer in case of leftovers while ( len = read( FIN, buffer, 64000, length(buffer) ) ) { if ( chunklen + len > 400MB ) { # got to end of chunk, deal with newline lastnewline = rindex( buffer, "\n" ); if ( lastnewline ) { # flush up to last found newline write( FOUT, substr( buffer, 0, lastnewline ) ); substr( buffer, 0, lastnewline ) = ""; close( FOUT ); last; # skip to next file } else { # flush entire buffer (no newline found) write( FOUT, buffer ); buffer = ""; } } else { # not at end of chunk, just write buffer print( FOUT, buffer ); buffer = ""; chunklen += len; } } # while we've got something to read } # while not at eof of input [download] Update: had to ensure read was to end of buffer, close chunk file when done	[reply] [d/l]
Re: File splitting help by davido (Cardinal) on Jan 20, 2009 at 22:45 UTC
You can set the input record separator (`$/`) to a byte quantity rather than an EOL character, and read in chunks that way. Then write each chunk into a different file. Refer to perlvar. You can use File::Find to get all your filenames, or just readdir. It would be easy to generate filenames along the lines of original_name.1 original_name.2, .3, .4, etc. I'm wondering though, if there's not a better solution. Rather than splitting up hundreds of log files into thousands of chunks, why not devise a Perlish solution to scan through the files for specific things you're looking for? It shouldn't matter how big the data set is, as long as you come up with an efficient way of finding what you're looking for within that data set. Are you looking for a particular event? Use Perl to scan your hundreds of files and index where the events are recorded. Dave	[reply] [d/l]
Re^2: File splitting help by Anonymous Monk on Jan 21, 2009 at 14:31 UTC
Unfortunately the logs come from multiple devices. there is not much uniqueness to the things that are being looked for. At times to locate a problem a device has encountered, requires to go through the log manually. I know parsing the logs for specific things is the way to go, which I'm already doing, however as much as I want to automate the process, some things require manual intervention. Obviously after splitting the log into chunks, I remove the original.	[reply]