Hi BrowserUk
Your indexing program is very neat. I never realized that it can be so simple and effective (in perl). Thanks!
I tried it on some large files (50 to 500 MB, with average line length about 150 characters).
I quickly spotted speed a problem with
$index .= pack 'd', tell FILE while <FILE>;
... it has to copy all previously packed data in every .= operation, so that the time grows with the square of number of lines. A big oh, O(x^2) to be precise.
Here is my (almost) drop-in replacement which trades memory space for indexing time
my @index = ( pack 'd', 0 );
push @index, pack 'd', tell FILE while <FILE>;
pop @index;
my $index = join '', @index;
... and the timing that shows roughly O(x) times for mine, and O(x^2) for yours (you can see the parabola in the table, if you look at it sideways).
In the last test case (the 527 MB file) with my script version the process memory usage peaked at +270 MB for a final index size of 27.5 MB.
Indexing mine : time 1 s, size 19949431, lines 136126
Indexing mine : time 2 s, size 40308893, lines 258457
Indexing mine : time 5 s, size 95227350, lines 634392
Indexing mine : time 29 s, size 527423877, lines 3441911
Indexing yours: time 2 s, size 19949431, lines 136127
Indexing yours: time 6 s, size 40308893, lines 258458
Indexing yours: time 31 s, size 95227350, lines 634393
Indexing yours: time 809 s, size 527423877, lines 3441912
I also added
pop @index;, to get rid of the last index - it points to the end of the file, after the last line.
Rudif
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.