Re: Merge and split files based on number of lines

Hi Chanti!

Like Grandfather, I am having some trouble really understanding the overall objective/work flow. From what you are describing, sounds like you want to send the first N lines of a file to a pipe and save the "leftover" lines (if any such lines exist) to another file for future processing at a later time?

Physically on the disk, no matter what tools you use, this means reading the entire input file. The first N lines would be sent to the pipe for processing by another program and TotalInputFileLines -N need to be written back to the disk.

You can determine the number of bytes in the input file without reading it (this is a number that the file system alredy knows). But counting the lines requires reading the data and looking for line endings.

My first question is: Why save totalLines-N lines back to the disk? Why not just process them now? That way you only read all of the data once and you don't have to save raw unprocessed data back to the disk.

Another question: What percentage of the input file is typically processed? This could matter. If the percentage is "small", then it might make sense to a) determine the current byte offset, "X", b)close input file, re-open in binary mode, throw away the first X bytes, copy all remaining bytes to the new file. This would require some experimentation. But binary file operations are faster than text mode operations because there is no searching for line endings.

It could be faster if the files you are writing and ones you are reading are on different physical disk drives.

Any performance data or other info could help us help you. A few thousand files and 60m lines is not particularly intimidating.

Update with another comment: There can be some performance issues with your processing pipeline. The pipe has a finite capacity. The sender can't spew it out any faster than the receiver can take it. There are solutions to these sort of problems, but more info is needed.

Comment on Re: Merge and split files based on number of lines