joomanji has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have a question to ask about rapid text searches in perl. I have a text format file as follow with 3 columns. The text file can be very huge, exceeding 1.5GB of size. Currently I'm using grep to look for the data I need. Let say I need to know the pair for 1101077781160, I would use grep to look for it. However, when the input is getting bigger, let say 1000 of entry to be grep, the process become very slow!

The question is if I'm going to use perl to store the text into hash, can I rapidly increase the search time?

And if I only looking for a single entry, perl script will still load everything into memory before it could look for the string I want?
Thank you for any input, would really like to know how can I speed up the text searches... thank you!
1101077781160 1101077783656 bothChaff 1101077781161 1101077783657 bothChaff 1101077781162 1101077783658 bothChaff 1101077781163 1101077783659 bothChaff 1101077781164 1101077783660 oneChaff 1101077781165 1101077783661 oneChaff 1101077781166 1101077783662 bothChaff 1101077781167 1101077783663 bothChaff 1101077781168 1101077783664 bothChaff 1101077781171 1101077783667 bothChaff 1101077781172 1101077783668 bothChaff 1101077781173 1101077783669 bothChaff 1101077781175 1101077783671 bothChaff 1101077781176 1101077783672 bothChaff 1101077781177 1101077783673 bothChaff 1101077781179 1101077783675 bothChaff 1101077781180 1101077783676 oneDegen 1101077781181 1101077783677 bothChaff 1101077781182 1101077783678 oneChaff 1101077781184 1101077783680 bothChaff

Replies are listed 'Best First'.
Re: Rapid text searches ( O(1) space and time)
by BrowserUk (Patriarch) on Jun 06, 2009 at 03:38 UTC

    If, as your sample data suggests, the key numbers are strictly ascending, though not sequential, and your records are fixed length (or can easily be made so with a fast one-pass one-liner), then you can find your values very quickly (O(1) time and space!), by calculating the approximate position and then seeking forward or backward from that starting point.

    The following crude demo code:

    #! perl -sw use 5.010; use strict; use Time::HiRes qw[ time ]; my $start = time; my $recLen = 43; open I, '<', '768941.dat' or die $!; my $first = ( split ' ', scalar <I> )[ 0 ]; seek I, -$recLen, 2; ## SEEK_END my $last = ( split ' ', scalar <I> )[ 0 ]; say "$first - $last"; my $recs = tell( I ) / $recLen; my $factor = $recs / ( $last - $first ); my $success = 0; while( my $target = <> ) { chomp $target; warn "Input outside range" if $target > $last or $target < $first; my $offset = int( ( $target - $first ) * $factor ) * $recLen; my $rec; seek I, $offset, 0 or die "Seek failed\n"; my $found = ( split ' ', $rec = <I> )[ 0 ]; if( $found != $target ) { my $direction = $found < $target; $offset = $direction ? $recLen : ( -$recLen * 2 ); seek I, $offset, 1 or die "Seek failed" ## SEEK_CUR while ( $found = ( split ' ', $rec = <I> )[ 0 ] ) and $direction ? ( $found < $target ) : ( $found > $target + ); } if( $found == $target ) { print "Found: $rec"; ++$success; } else { warn "Target $target not found\n"; } } close I; printf STDERR "Found $success records in %.3f seconds\n", time() - $st +art;

    Looks up 1000 (existing) records:

    c:\test>wc -l 768941.search 1000 768941.search c:\test>head 768941.search 1101077781160 1101077781161 1101077781162 1101077781163 1101077781164 1101077781165 1101077781166 1101077781167 1101077781168 1101077781169

    in an approximation of your data file:

    In two or three hundreths of a second:

    c:\test>768941 768941.search >nul Found 1000 records in 0.02 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.03 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.03 seconds

    But that is with a warm cache. It took a whole 1/10th of a second the first time :) That's faster than you could make a connection to most databases.

    The trouble with people who have sledgehammers, everything looks like a tough nut.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      That is hot and I am, as usual, amazed at your comp-sci chops. I cobbled together a DB_file example because I was curious if it would be slower.

      It's not a 1:1 comparison because it's a straight key lookup, not a real search like yours; but that's what I took the OP to want/need. I only built my test file up to .75GB-ish because of lack of headroom on my drive (I only have 2GB left and I had to build two of these files so...).

      Fake data generator-

      my @what = qw( bothChaff bothDegen oneChaff oneDegen ); open my $fh, ">", "data.data" or die; for ( 10000000 .. 99999999 ) { next if rand(10) > 6.7; print $fh sprintf("11010%d\t11010%d\t%s\n", $_, $_+3333, $what[rand@what]); last if -s $fh > 1_000 * 1_000 * 1_000; }

      Search + search DB builder-

      use DB_File; use strict; use Time::HiRes qw[ time ]; my $start = time; my %A; tie %A, "DB_File", "db.file", O_CREAT|O_RDWR, 0666, $DB_HASH or die; if ( $ARGV[0] eq "build" ) { open my $fh, "<", "data.data" or die; while (<$fh>) { chomp; my ( $key, $val ) = split /\s+/, $_, 2; $A{$key} = $val; } } else { print $A{$ARGV[0]} || "nothing found", $/; printf "Took %.2f seconds\n", time() - $start; } -- moo@cow[331]~/bin>pm-768941 1101078637800 1101078641133 oneChaff Took 0.02 seconds

      I got 0.02 seconds on almost every run and this is on a nine year-old computer and, I think, a five year old copy of the related Berekely binaries. I turned the printf to .3f and it was mostly ranging from 0.016 to 0.019 seconds. A caveat being that it took a loooooooong time to build the DB file. I wasn't clocking it but it was edging up on an hour.

      <rejoinder type="in good fun!"> The trouble with engineers who are smarter than most everyone else is that given Y they solve for X where X is a more interesting problem than Y. </rejoinder>

        it's a straight key lookup, not a real search like yours; but that's what I took the OP to want/need.

        I think the OP just needs to find the appropriate records in a timely fashion; whether that's done by indexing the file, or searching the file doesn't really matter. Though the hour to build the index could be critical if the file changes with any frequency. If say, it's the output from some other process or system.

        A rather more important difference (assuming I'm reading your code and console log correctly), is that your 0.016 - 0.019 seconds is the time taken to lookup one value.

        My timings:

        c:\test>768941 768941.search >nul Found 1000 records in 0.032 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.020 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.036 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.020 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.023 seconds

        are the time to find 1000 values!

        given Y they solve for X where X is a more interesting problem than Y.

        Given 1000 keys, he wanted to find the corresponding values in a file containing ~38e6 pairings as quickly as possible. That's all my code does. I don't understand why you think I'm doing something different?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      It's been long since I learned the big O notation but IMHO your code can only be O(1) if and only if there are no gaps within the data.

      Just consider the worst case of IDs within (1, (1/2 * m) .. m) i.e. (n= 1/2 m + 2) and your looking for the second entry x=m/2 right after the maximum possible gap. You'll start with approximating 3/4 m and continue to seek linearly till m/2. That are 1/2 n steps which means O(n).

      If you improve your code to recursively interpolating the next position between the narrowed range instead going linearly you'll make it with O(log(n)) like in binary search.

      Furthermore you're only comparing the entries from the first column aren't you?

      Cheers Rolf

        The headline claim for O(1) time is based on comparison with Perl's hashes which also do a linear search down a bucket chain to resolve hash clashes, but are generally rated as O(1).

        For your "worst case scenario to be so, it would require the file to contain 38e6-1 consequetive numbers and a single outlier 38e6 away from one end. My conclusion was that is an unlikely scenario for real data, My assumption is for a reasonably even distribution of the values through the range.

        Furthermore you're only comparing the entries from the first column aren't you?

        I do not understand the question?

        I take the ops "Let say I need to know the pair for 1101077781160" to mean that given the input '1101077781160', he requires the other two values on that line. Eg. '1101077783656 bothChaff'


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      As promised I did a brute force calculation for the odds that this algorithm is fast.

      Extrapolating the example from the OP in a 1.5 GB File and randomly skippping 1 out of 4 lines it seems you'll miss the searched line in average for more than 1200 entries before you start searching linearly.

      I stopped after 2 test runs since my laptop with 1 GB RAM was heavily swapping for 30 minutes then* ... I'll post the completed results tomorrow morning. : )

      Temporary conclusion, over 1200 seeks doesn't even look close to O(1) for me, could it be your testdata was a little optimistic? Hope you understand my critic now.

      Please Note: This is the very optmistic best case for only a 25% gap probability. I didn't even try to insert any worsening conditions now!

      That much about the normal average case.

      Cheers Rolf

      (*) I could boost the testperformance by avoiding to build up the @data array and just approximating the accuracy of the interpolation within the loop, but I'm already too bored now!

      CODE:

      OUTPUT:

Re: Rapid text searches
by Your Mother (Archbishop) on Jun 05, 2009 at 22:47 UTC

    If this is a write once / read many situation, I'd recommend Berkeley DBs; BerkeleyDB, DB_File. It would be mildly quite slow to build but after that look-ups are, as a coworker used to say, greasy-fast.

Re: Rapid text searches
by GrandFather (Saint) on Jun 05, 2009 at 23:30 UTC

    Your Mother's suggestion of using a database is likely to be the best answer, but where you want to speed up a search for multiple records using the text file you can do something like:

    use strict; use warnings; my @wanted = qw(1101077781172 1101077781180 1101077781183); my %match = map {$_ => 1} @wanted; while (<DATA>) { chomp; my (@fields) = split; next unless @fields >= 3; print "$_\n" if exists $match{$fields[0]}; } __DATA__ 1101077781160 1101077783656 bothChaff 1101077781161 1101077783657 bothChaff 1101077781162 1101077783658 bothChaff ...

    Prints:

    1101077781172 1101077783668 bothChaff 1101077781180 1101077783676 oneDegen

    True laziness is hard work
Re: Rapid text searches
by LanX (Saint) on Jun 06, 2009 at 00:23 UTC
    OK, let me describe how I see your problem, you have consecutive IDs in two columns and your looking for the lines where a special ID occurs in one of the columns?

    I assume it's more complicated than in your example and gaps are possible, so you can't just trivially calculate the right lines to look for?

    And a binary search is either no option?

    With a strict ordering, it's like a telephone book, no need to start reading from the first page on, first look for the right city and the starting letters.

    I would build up a hash of hashes for indexing ranges of line numbers for each column.

    with 1101077781160 you'll look up $hash1{11010} where you get a second hash to lookup $hash2{77781} telling you the range to search for "160".

    The firstlevel hash (the city) should have a size thats easily kept in RAM, the second level hashes should be loaded on demand (and some - maybe the last hundred - kept cached).

    Of course you could have more levels of hashes and I'm not sure about the best way to make hashes persistent and quickly loaded, but this are CPAN-details.

    Cheers Rolf

Re: Rapid text searches
by JavaFan (Canon) on Jun 06, 2009 at 16:32 UTC
    It seems that you want to search for exact matches columns. If you know all the things you are looking for before hand, I would create a hash with the items you are searching for. So for instance if you are looking for 1101077781160, 1101077783669, and 1101077781182, I'd start off with
    my %queries; $queries{$_} = 1 for qw(1101077781160 1101077783669 1101077781182);
    Then I'd iterate over the file, split each line into fields, and test each field against the hash previously build.

    This avoid multiple passes over the file (as you would with grep), or loading in the entire file into a hash.

    And if I only looking for a single entry, perl script will still load everything into memory before it could look for the string I want?
    You tell us instead of asking! You wrote the code. If you wrote your code it first slurps in the entire file, then the answer is yes. If you wrote your code it didn't, then the answer is no.