Re: Read Some lines in Tera byte file
by BrowserUk (Patriarch) on Oct 13, 2010 at 07:12 UTC
|
- If you need to read line 100 once, read from the beginning.
- If you know you'll need to read line 100 again, remember where (tell) in the file it was located.
And while you're at it, you may as well remember where lines 1 through 99 where located also.
- If you need to read lines at random and often, and quickly: see Re: help reading from large file needed.
It will take a local file, of ( 8*number-of-lines ) bytes, to hold the index, and roughly the same amount of time as wc -l.
With a slightly more sophisticated indexer, the size of the index file can be reduced by half.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] |
Re: Read Some lines in Tera byte file
by ikegami (Patriarch) on Oct 13, 2010 at 06:36 UTC
|
Unless you happen to know the position at which the 100th line starts, you'll have to read from the start of the file. You can stop reading any time, though.
for (1..100) {
<$fh>;
}
for (1..100) {
my $line = <$fh>;
...
}
| [reply] [Watch: Dir/Any] [d/l] |
|
for (1..100) {
<$fh>;
}
we are reading from 1st line to 100th lines, then
for (1..100) {
my $line = <$fh>;
...
}
we are saving required to a variable i.e from position 100th line 200th line
So unnecessary reading first 100 lines in the first code, is it possible directly read from 100th line?
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Already answered. Unless you happen to know the position at which the 100th line starts, you'll have to read from the start of the file. If you do, you can jump straight to that position using seek.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
|
|
|
|
Re: Read Some lines in Tera byte file
by cdarke (Prior) on Oct 13, 2010 at 08:32 UTC
|
Is it possible without reading entire file, only to fetch required lines from the file ?
Well, you do not say what type of file it is, if the lines are a fixed length or not, and which operating system you are on.
Back in the olden days file formats were many and varied, and often supported indexes, even on lines in a file containing text. That is not generally done these days on UNIX or Windows. A text file does not contain physical line records anymore, it is just a stream of bytes. So when a file looks like this in a text editor or file viewer: This is line 1
This is line 2
This is line 3
in fact the file really looks like this (on UNIX):This is line 1\nThis is line 2\nThis is line 3\n
where "\n" is a newline character. Windows text files by convention have "\r\n" between each line, and might be terminated by ^Z (control-Z).
So, a text file is just a stream of bytes. Saying that you want to seek to line 100 means that you need the position of the start of line 100 in the file, there is no index of line positions attached to the file unless you construct one yourself. If the lines are of fixed length then it is easy to derive that position. Some log files do have fixed length lines, but most do not.
One possibility to improve performance, particularly if the file is accessed over a network, is to zip it up then use an unzip program to pipe the data to you, for example:open (my $zip, '-|', 'gzip -dc compressed_file.gz') ||
die "Can't run gzip: $!";
while (<$zip>) {
# do some stuff
}
close $zip;
There are modules on CPAN that will do this as well, but I don't have any experience of them. How much I/O this will save depends on how much compression can be done, and that is data dependant. It might even be slower, you will have to experiment.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Read Some lines in Tera byte file
by ctilmes (Vicar) on Oct 13, 2010 at 11:12 UTC
|
Displaying/buffering huge text files discusses some index techniques to speed up picking random lines out of a huge file.
The OP in that thread was able to index a 100MB file with ~6 million lines in seconds and pick out random lines in milliseconds by pre-indexing every 1000 lines.
If your 1TB file had avg 80 characters per line, you'd have 13 billion lines. If you indexed every 1000 lines, you could index 13 million lines. At 8 bytes per index point, that would be 100MB for the index.
You pre-index by reading the entire file, and every 1000 lines, write the file position to your index file.
Then, for example, to find line 12345, you divide that by 1000, and read the 12'th index position in the index (12*8). Then seek to that position in the big file and start reading. After 345 lines you're at your position.
You can tune the 1000 to trade off speed and space.
| [reply] [Watch: Dir/Any] |
Re: Read Some lines in Tera byte file
by sundialsvc4 (Abbot) on Oct 13, 2010 at 11:40 UTC
|
An approximate index, (e.g. the position of every thousandth line of data) is probably a very reasonable approach to use here. (SQLite is amazingly useful for such things.) You really only have to get the computer “into the general neighborhod,” because when it does the disk-seek it’s going to bring in several sectors’ worth of data.
Another very useful technique, if you can manage it, is to first sort your update (or search) keys into the same order as the file itself. Now, you can move through the data one time, perhaps sequentially. Whatever updates or changes you need to make to any particular region of the file, you will be able to do “all at once, and then move on.”
These strategies were, of course, absolutely necessary when the only “mass” storage device we possessed were digital reel-to-reel tapes that stored a few hundred bytes per inch, but they are still very-surprisingly apropos to this day. Although we have high-density disks that rotate at thousands of RPMs, many of our “ruling constraints” when dealing with large data sets are still physical ones. “Seek time,” and “rotational latency.”
Or, in this case ... network time and bandwidth! Is it possible, for instance, to do this work on the server computer directly? When dealing with a huge network-based file, you really, really want to do that... because otherwise, every one of those trillions of bytes are going to be transmitted down the pipe between the two computers. Z-z-z-z-z-z-z....
| [reply] [Watch: Dir/Any] |
Re: Read Some lines in Tera byte file
by marto (Cardinal) on Oct 13, 2010 at 06:21 UTC
|
Isn't this the same question as How to read 1 td file? Perhaps mount the remote disk and use something like Tie::File, or run your script on the remote server.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
I posted from my phone, I couldn't realistically try this at the time. It is, as I suppose you're suggesting, pretty darn slow. I'd used Tie::File in the past, but not with files this large.
Cheers
| [reply] [Watch: Dir/Any] |
|
|
Re: Read Some lines in Tera byte file
by jethro (Monsignor) on Oct 13, 2010 at 11:17 UTC
|
If the file doesn't get changed often and you have the ability to seek in it, then I would create an index of where every hundreth or thousand line starts, i.e where line 1000, line 2000, line 3000... starts. That way my index file would be relatively small and I would have to read maximal 99 or 999 unnecessary lines before reading the payload.
But as I said, this depends on the data being unchanging or only getting updated by appending
| [reply] [Watch: Dir/Any] |