Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Read Some lines in Tera byte file

by Anonymous Monk
on Oct 13, 2010 at 06:14 UTC ( [id://865016]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I want to read only some lines in file.

File located in some server

1)Files size is in terabytes

2)My hard disk size is in Giga bytes

let's say example i want to read from 100th line to 200 line in a file.
So, is it possible without reading entire file, only to fetch required lines from the file ?

Replies are listed 'Best First'.
Re: Read Some lines in Tera byte file
by BrowserUk (Patriarch) on Oct 13, 2010 at 07:12 UTC
    • If you need to read line 100 once, read from the beginning.
    • If you know you'll need to read line 100 again, remember where (tell) in the file it was located.

      And while you're at it, you may as well remember where lines 1 through 99 where located also.

    • If you need to read lines at random and often, and quickly: see Re: help reading from large file needed.

      It will take a local file, of ( 8*number-of-lines ) bytes, to hold the index, and roughly the same amount of time as wc -l.

      With a slightly more sophisticated indexer, the size of the index file can be reduced by half.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Read Some lines in Tera byte file
by ikegami (Patriarch) on Oct 13, 2010 at 06:36 UTC
    Unless you happen to know the position at which the 100th line starts, you'll have to read from the start of the file. You can stop reading any time, though.
    for (1..100) { <$fh>; } for (1..100) { my $line = <$fh>; ... }

      Thanks for reply

      Here ..

      for (1..100) { <$fh>; }

      we are reading from 1st line to 100th lines, then

      for (1..100) { my $line = <$fh>; ... }

      we are saving required to a variable i.e from position 100th line 200th line

      So unnecessary reading first 100 lines in the first code, is it possible directly read from 100th line?

        Already answered. Unless you happen to know the position at which the 100th line starts, you'll have to read from the start of the file. If you do, you can jump straight to that position using seek.

        No.

Re: Read Some lines in Tera byte file
by cdarke (Prior) on Oct 13, 2010 at 08:32 UTC
    Is it possible without reading entire file, only to fetch required lines from the file ?

    Well, you do not say what type of file it is, if the lines are a fixed length or not, and which operating system you are on.

    Back in the olden days file formats were many and varied, and often supported indexes, even on lines in a file containing text. That is not generally done these days on UNIX or Windows. A text file does not contain physical line records anymore, it is just a stream of bytes. So when a file looks like this in a text editor or file viewer:
    This is line 1 This is line 2 This is line 3
    in fact the file really looks like this (on UNIX):
    This is line 1\nThis is line 2\nThis is line 3\n
    where "\n" is a newline character. Windows text files by convention have "\r\n" between each line, and might be terminated by ^Z (control-Z).

    So, a text file is just a stream of bytes. Saying that you want to seek to line 100 means that you need the position of the start of line 100 in the file, there is no index of line positions attached to the file unless you construct one yourself. If the lines are of fixed length then it is easy to derive that position. Some log files do have fixed length lines, but most do not.

    One possibility to improve performance, particularly if the file is accessed over a network, is to zip it up then use an unzip program to pipe the data to you, for example:
    open (my $zip, '-|', 'gzip -dc compressed_file.gz') || die "Can't run gzip: $!"; while (<$zip>) { # do some stuff } close $zip;
    There are modules on CPAN that will do this as well, but I don't have any experience of them. How much I/O this will save depends on how much compression can be done, and that is data dependant. It might even be slower, you will have to experiment.
Re: Read Some lines in Tera byte file
by ctilmes (Vicar) on Oct 13, 2010 at 11:12 UTC
    Displaying/buffering huge text files discusses some index techniques to speed up picking random lines out of a huge file.

    The OP in that thread was able to index a 100MB file with ~6 million lines in seconds and pick out random lines in milliseconds by pre-indexing every 1000 lines.

    If your 1TB file had avg 80 characters per line, you'd have 13 billion lines. If you indexed every 1000 lines, you could index 13 million lines. At 8 bytes per index point, that would be 100MB for the index.

    You pre-index by reading the entire file, and every 1000 lines, write the file position to your index file.

    Then, for example, to find line 12345, you divide that by 1000, and read the 12'th index position in the index (12*8). Then seek to that position in the big file and start reading. After 345 lines you're at your position.

    You can tune the 1000 to trade off speed and space.

Re: Read Some lines in Tera byte file
by sundialsvc4 (Abbot) on Oct 13, 2010 at 11:40 UTC

    An approximate index, (e.g. the position of every thousandth line of data) is probably a very reasonable approach to use here.   (SQLite is amazingly useful for such things.)   You really only have to get the computer “into the general neighborhod,” because when it does the disk-seek it’s going to bring in several sectors’ worth of data.

    Another very useful technique, if you can manage it, is to first sort your update (or search) keys into the same order as the file itself.   Now, you can move through the data one time, perhaps sequentially.   Whatever updates or changes you need to make to any particular region of the file, you will be able to do “all at once, and then move on.”

    These strategies were, of course, absolutely necessary when the only “mass” storage device we possessed were digital reel-to-reel tapes that stored a few hundred bytes per inch, but they are still very-surprisingly apropos to this day.   Although we have high-density disks that rotate at thousands of RPMs, many of our “ruling constraints” when dealing with large data sets are still physical ones.   “Seek time,” and “rotational latency.”

    Or, in this case ... network time and bandwidth!   Is it possible, for instance, to do this work on the server computer directly?   When dealing with a huge network-based file, you really, really want to do that... because otherwise, every one of those trillions of bytes are going to be transmitted down the pipe between the two computers.   Z-z-z-z-z-z-z....

Re: Read Some lines in Tera byte file
by marto (Cardinal) on Oct 13, 2010 at 06:21 UTC

    Isn't this the same question as How to read 1 td file? Perhaps mount the remote disk and use something like Tie::File, or run your script on the remote server.

      use something like Tie::File,

      Please try this yourself--say read the middle line in a 1 or 2GB file--before suggesting it to others.

        I posted from my phone, I couldn't realistically try this at the time. It is, as I suppose you're suggesting, pretty darn slow. I'd used Tie::File in the past, but not with files this large.

        Cheers

Re: Read Some lines in Tera byte file
by jethro (Monsignor) on Oct 13, 2010 at 11:17 UTC

    If the file doesn't get changed often and you have the ability to seek in it, then I would create an index of where every hundreth or thousand line starts, i.e where line 1000, line 2000, line 3000... starts. That way my index file would be relatively small and I would have to read maximal 99 or 999 unnecessary lines before reading the payload.

    But as I said, this depends on the data being unchanging or only getting updated by appending

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://865016]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-29 00:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found