record separators

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

hi to you all

i need some help with this one. i have a file that needs to be searched through several times. and this file is actually a set of some multiline records that are separated with >#W# marking. so my idea was to first regularly sweep through the whole file (because i need to do that to identify some parts of those records interesting to me) but once i did it i can create a smaller file with those indicative records associated with a record id (because every time i identify the >#W# i'll give an id too that record). why am i trying to do this when there is a perfectly good SQLite db engine which is created for this purpose and which i use frequently. well the reason is i was asked to do that bypassing the db engine of any sort. and since the speed was an issue (whatever) i figured i could speed up my searches by identifying record separators. then if i knew in which record is my interesting thing i could just iterate to that record and deal with it.because if i loop through a file by distinct record separator>#W# i am faster as it can be seen from the benchmark:






use strict;
use warnings;
use Benchmark qw(:all) ;

    my $x = $ARGV[0];
    my $r = timethese( 5, {
        a => sub{
        open (BOUT, "<", $x) ||  die "Ther were some problems with ope
+ning BLAST output : $!";
        $/ = ">#W#";   
        my $count = 0 ;
        while(<BOUT>){ 
        $count++ if (m/>/g)  ;
        }              
        $/="\n";
        close BOUT;
        #print "$count \n";
        },
        b => sub{
        open (BOUT, "<", $x) ||  die "Ther were some problems with ope
+ning BLAST output : $!";
        my $count = 0 ; 
        while(<BOUT>){  
        $count++ if (m/>/g);       
        }               
        close BOUT;
        #print "$count \n";
        
        
        },
    } );
    cmpthese $r;


Benchmark: timing 5 iterations of a, b...
         a: 43 wallclock secs (23.06 usr +  1.86 sys = 24.92 CPU) @  0
+.20/s (n=5)
         b: 169 wallclock secs (160.39 usr +  1.34 sys = 161.74 CPU) @
+  0.03/s (n=5)
  s/iter    b    a
b   32.3   -- -85%
a   4.98 549%   --
[download]

so now the problem is: 1. does anyone has a better suggestion on how to do the whole thing 2. how to deal with a record once i found it, because this record is then to be searched line by line so i would have to switch from $/=">#W#"; to $/="\n";and then do something like:

$/ = ">#W#";   
        my $count = 0 ;
        while(<BOUT>){ 
        $count++ if (m/>/g)  ;
        
        $/="\n";

          while(){}
        $/=">#W#";


        }              
        $/="\n";
[download]

but ofcourse this doesn't work , how to make it work?

thnx

Comment on record separators Select or Download Code

Replies are listed 'Best First'.
Re: record separators by mzedeler (Pilgrim) on Jul 16, 2009 at 22:45 UTC
To me it looks like you are spending a lot of time reinventing the wheel (and not a very useful version, that is). Why aren't you just using one of the many indexed file formats readily available, such as DB_File?	[reply]
Re: record separators by mr_mischief (Monsignor) on Jul 16, 2009 at 22:29 UTC
One way to do it would be to use a byte index into the file as part of your index rather than just assigning record numbers. That way, you could seek into the file, where you'd be at the beginning of a record, then process the record from there.	[reply]