comment on

Hi Chanti!

Like Grandfather, I am having some trouble really understanding the overall objective/work flow. From what you are describing, sounds like you want to send the first N lines of a file to a pipe and save the "leftover" lines (if any such lines exist) to another file for future processing at a later time?

Physically on the disk, no matter what tools you use, this means reading the entire input file. The first N lines would be sent to the pipe for processing by another program and TotalInputFileLines -N need to be written back to the disk.

You can determine the number of bytes in the input file without reading it (this is a number that the file system alredy knows). But counting the lines requires reading the data and looking for line endings.

My first question is: Why save totalLines-N lines back to the disk? Why not just process them now? That way you only read all of the data once and you don't have to save raw unprocessed data back to the disk.

Another question: What percentage of the input file is typically processed? This could matter. If the percentage is "small", then it might make sense to a) determine the current byte offset, "X", b)close input file, re-open in binary mode, throw away the first X bytes, copy all remaining bytes to the new file. This would require some experimentation. But binary file operations are faster than text mode operations because there is no searching for line endings.

It could be faster if the files you are writing and ones you are reading are on different physical disk drives.

Any performance data or other info could help us help you. A few thousand files and 60m lines is not particularly intimidating.

Update with another comment: There can be some performance issues with your processing pipeline. The pipe has a finite capacity. The sender can't spew it out any faster than the receiver can take it. There are solutions to these sort of problems, but more info is needed.

In reply to Re: Merge and split files based on number of lines by Marshall
in thread Merge and split files based on number of lines by Sekhar Reddy

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.