Esteemed monks,

In Displaying/buffering huge text files I presented a need for a buffering module that will allow smooth display of huge text files in GUIs (read-only). A very interesting and live discussion commenced, and I concluded with the solution: using an internal buffer + decimated indexing of one-in-1000-lines that gave good performance with minimal memory consumption.

But, in real life, like in real life, complications tend to spring up unexpectedly. An additional requirement for this module now imposes some serious questions on the design.

The new requirement is, in essence, simple: there should be a way to filter certain lines out of a file, i.e. never show lines that start with "Foobar:".

At first this doesn't look tough, but given some though it complicates matters enormously. The most annoying thing in such requirements is that they actually make sense (filtering is important on very big files).

I can assume to have all filters beforehead. Say that I know that a user might want to filter out "Foobar:" lines. In any point in the GUI the user may ask to enable or disable the filter.

I'm now thinking of: making the filtering transparent to the GUI, in the buffer. The GUI requests line 115 - the buffer knows that if the file isn't filtered, it's the real line 115 from the file and acts according to its original algorithm. But if the buffer knows that filtering is enabled, it should provide the 115th unfiltered line.

It probably means that I need, on startup, create a separate indexing for each filter. Not only that, however, because the "real" distance between two adjacent unfiltered can be 5000 lines in the file. I wouldn't want to wade through them all just to find the next file.

In addition, indexing of filtered lines on startup imposes a severe performance hit. Instead of simply reading in each line and counting them, I should now actually also apply a regular expression to it.

Any ideas ? I guess I can make it fast sacrificing a lot of space, but that is not really good for me.


In reply to Further on buffering huge text files by spurperl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.