Hi all, I'm attempting to find a fast way to manipulate pretty large files (well, anything from like 100k to 2Gb).

As a quick run down - the files themselves containg a mini-markup language for driving laser printers. Each line in the file (\n delimited - MS Windows) is a separate instruction. The lines are then grouped into the commands to create a specific page and then the pages are grouped into sets of related pages. (These all get represented by objects that cache the data as it's discovered and make data extraction easier).

To cut a long story short, I need a method of being able to navigate around the file in as effeicient and speedy a manner possible (speed is probably more of a consideration than efficiency (memory usage et al) in this case).

Currently I'm using Tie::File but I'm not sure if this is the best way. I have the problem really that, if I want a line near the start of the file it gets returned pretty quickly, but if it's near the end it's taking a fair amount of time.

I was thinking about IO::File, but then to able to directly get a line I'd need to index the file first (else I don't know where to seek to (the lines are all variable in length)).

There are a few likely looking modules on CPAN but never having used them I'm not familiar with their strengths / weaknesses so I'd value some opinions.

Any code that can read the file also needs to be able to write to it so that the file may be amended - currently this gets done by hand in something like UltraEdit and is fairly clunky so I'm hoping what I'm developing will take some of the pain out of it :)

If I haven't covered something here adequately enough just let me know and I'll try to clarify :)

This is all based on MS Windows 2000/XP desktops and servers running ActivePerl 5.6.1 (build 633).

Thanks in advance,

Quick aside:
Just wondering if there's any reason why all my replies just got downvoted? :-?

Thanks all for the advice so far though. Sticking with Tie::File looks like getting into some kind of indexing. Is Tie::File the best solution here though (short of reading the thing into a db which I would if I could :)) or are there modules out there more suited to the task? I saw File::RandomAccess but it doesn't appear to be available via ActiveState PPM so it'd be a nightmare getting onto machines here.

--- Jay

All code is untested unless otherwise stated.
All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
If in doubt ask.


In reply to How to get fast random access to a large file? by gothic_mallard

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.