comment on

As promised I did a brute force calculation for the odds that this algorithm is fast.

Extrapolating the example from the OP in a 1.5 GB File and randomly skippping 1 out of 4 lines it seems you'll miss the searched line in average for more than 1200 entries before you start searching linearly.

I stopped after 2 test runs since my laptop with 1 GB RAM was heavily swapping for 30 minutes then* ... I'll post the completed results tomorrow morning. : )

Temporary conclusion, over 1200 seeks doesn't even look close to O(1) for me, could it be your testdata was a little optimistic? Hope you understand my critic now.

Please Note: This is the very optmistic best case for only a 25% gap probability. I didn't even try to insert any worsening conditions now!

That much about the normal average case.

Cheers Rolf

(*) I could boost the testperformance by avoiding to build up the @data array and just approximating the accuracy of the interpolation within the loop, but I'm already too bored now!

CODE:

use strict;
use warnings;

$\="\n";
$_="";

#--- Settings
my $debug=0;
my $runs=5;
my $gap_probability=.25;
my $min=0;
my $max=35_000_000;

#--- 
my $lines_per_skip=1/$gap_probability;
my $miss_avg_runs;

for my $run (1..$runs){
    print "\n","="x10 ."  run:".$run,"\n";
    my (@data,%gap,%miss);
    my $entry=0;
    my $last;
    my $x;


    for my $n ($min..$max) {
    if ( ($x=rand $lines_per_skip) > 1 ) {
      print "$entry:$n  ($x)" if($debug);
      $data[$entry++]=$n;
      $gap{$n-$last-1}++ if defined $last;
      $last=$n;
    }

    }

    my $factor=($data[-1]-$data[0])/$#data;
    print " gap_probability: $gap_probability \n interpolation_factor:
+ $factor ";

    $miss{ round( $_ * $factor + $data[0] - $data[$_] ) } ++   for (0.
+.$#data) ;


    print " Entries: $entry";
    print " =Data @data" if ($entry <20);
    print_histogram(\%gap,"Gaps");
    print_histogram(\%miss,"Misses");

    my $miss_sum;
    $miss_sum += abs($_) * $miss{$_} for (keys %miss );
    my $miss_avg = $miss_sum / $entry;  

    print "miss_avg:$miss_avg";


    $miss_avg_runs+=$miss_avg;
    print "miss_avg/runs:",$miss_avg_runs/$run;
}






sub round { 
  my $x = $_[0]; 
  my $d= $x<0 ? -0.5 : 0.5;
  return int($x+$d) ;

  my $test=sub {
    for (-3 .. 3) { 
      my $x= $_/5 ;
      print $x, round($x);
    }
  }
}


sub print_histogram{
  my ($href,$title)=@_;

  local ($\,$,);
  my $partition=8;

  my @keys=sort {$a <=> $b} keys %$href ;
  my $range=$keys[-1]-$keys[0];
  my $interval=$range/$partition;
  my $delimiter= $keys[-1] - $interval*($partition);
  my $partstart;#=$keys[0];

  print "\n-$title: ";
  if ( $interval < 1) {
    print "$_: $href->{$_} " for (@keys );
    print "\n";
    return;
  }

  print " for $partition Intervals\n";
  my $sum;
  for my $key (@keys) {
    my $value=$href->{$key};
    $sum+=$value;
    if ($key>=$delimiter) {
      print "],$key]: $sum  ";
      $partstart="";#$key;
      $delimiter+=$interval;
      $sum=0;
    }
  }
  print "\n";
  return;
}
[download]

OUTPUT:

Compilation started at Fri Jun 12 13:48:34

/usr/bin/perl -w /home/lanx/gauss.pl 

==========  run:1

 gap_probability: 0.25 
 interpolation_factor: 1.33330078809507 
 Entries: 26250641

-Gaps:  for 8 Intervals
],0]: 19686681  ],2]: 6154875  ],4]: 383422  ],5]: 19281  ],7]: 5975  
+],9]: 386  ],10]: 14  ],13]: 6  

-Misses:  for 8 Intervals
],-2700]: 10  ],-1987]: 1569713  ],-1274]: 5017419  ],-561]: 2169821  
+],152]: 4382814  ],865]: 4367173  ],1578]: 3316195  ],2291]: 3248721 
+ ],3003]: 2178775  
miss_avg:1232.34795784027
miss_avg/runs:1232.34795784027

==========  run:2

 gap_probability: 0.25 
 interpolation_factor: 1.33346175204987 
 Entries: 26247473

-Gaps:  for 8 Intervals
],0]: 19683096  ],2]: 6153987  ],4]: 384541  ],5]: 19528  ],7]: 5925  
+],9]: 371  ],10]: 17  ],12]: 5  ],13]: 2  

-Misses:  for 8 Intervals
],-4136]: 10  ],-3439]: 1459001  ],-2743]: 2947745  ],-2047]: 3788571 
+ ],-1351]: 996276  ],-655]: 3238221  ],41]: 7697843  ],737]: 4381294 
+ ],1433]: 1738512  
miss_avg:1286.98638348867
miss_avg/runs:1259.66717066447

Compilation interrupt at Fri Jun 12 14:22:47
[download]

In reply to Re^2: Rapid text searches (complexity test) by LanX
in thread Rapid text searches by joomanji

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.