in reply to file size limit with Tie::File

In order to print the number of lines in the file, Tie::File has to read the whole file. To do this, it has to page it through it's buffer, which by default is only 2MB. Hence, it does a lot of memory shuffling which causes the symptoms you are seeing.

Try increasing the buffer size using the memory option. See the docs for details.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: file size limit with Tie::File
by alw (Sexton) on Apr 08, 2007 at 21:16 UTC
    Thanks for the quick reply. I had tried different combinations of memory and cache options. Actually it eventually does return after about 5 minutes so it's not hung, just slow. I thought I would avoid reading the entire file into memory by using this module(wrong). I am trying to display a 1000 lines or so at a time in a Tk ROText widget and I thought this tie could work. I'll go back to while ( <FH> ) and tells and seeks and see if that gives me better performance.

      Unfortunately expecting to get the number of lines in a text file without reading the file is an impossibility no matter what module you use. Lines in a plain text file are of variable length. So there is no simple calculation that if a file is xxx kilobytes it must be yyy lines long. That means the only way for any program or module to determine how many lines you have, is to count how many "newline" characters are found in the file. And that's the same as counting any other character; you've got to read through the file to find out.

      Quick, how many lines are there in the camel book? Until you've counted them, you'll never know. There's no magic here. If you need a quick solution, do a line count once and save it, and modify it as the file gets modified.

      Tie::File is a convenience module, and it provides this convenience with what seems usually to be a very minor performance penalty. You have stumbled into a situation where the module doesn't appear to excel, but regardless of the solution you come up with, you're going to have to read the file at least once.


      Dave

      I am trying to display a 1000 lines or so at a time in a Tk ROText widget and I thought this tie could work. I'll go back to while ( <FH> ) and tells and seeks and see if that gives me better performance.

      You might find some relevant ideas in this older thread: Displaying/buffering huge text files.

      If you decide to take the time to index the byte offsets to all the line-endings in your log file, that will surely end up providing much better performance, but if the log file changes over time, you'll be updating the index constantly. Of course, that'll be a simple process of appending more byte offsets as more lines are added, but it's likely that the index will become unwieldy (maybe the line count is such that indexing all the lines is already unwieldy).

      If the goal is simply to be able to show a good-sized chunk of lines in a Tk ROText window, maybe you don't really need accurate info about where the line endings are. Just use reasonable estimates where necessary, along the following lines:

      $requested_start = ...; # a value between 0 and 1 $avg_line_len = ...; # make a guess or read a small sample to est +imate this $file_size = -s $filename; $read_length = $avg_line_len * 1000; seek( FH, $file_size * $requested_start, 0 ); read( FH, $text, $read_length ); $text =~ s/^.*\n//; # trim initial and final $text =~ s/.*$//; # line fragments from $text
      alw,
      "I thought I would avoid reading the entire file into memory by using this module(wrong)."

      You have misquoted BrowserUk who said Tie::File must read the whole file. Tie::File works by indexing the locations of $/ (default = "\n"). To do this, it must read the whole file but as BrowserUk points out - it only reads so much into memory at a time. Other advice in this thread applies but I wanted to point out that you were inaccurate.

      Update: Typo corrected thanks to BrowserUk++

      Cheers - L~R