in reply to Searching text files

diffrent recommendations:
1.- only open the file if you want to read it, then you won't have to close the handle if you didn't need to open it in the first place.
2.- if you want to keep it in the text files and parse them every time, consider File::Slurp, which is the fastest way to slurp i know.
3.- if you want to have it really fast, have a look at the first chapter of Programming Pearls by Jon Bentley. The Basic idea is to use a bitstring(initialized with zeroes), and toggle the bits to 1 (plain old binary OR)which represents the telefone number. As Telefone Numbers are unique, and you don't have any data associated with it, its an valid approach, to check for the existence of a number. If you want you can create that bistring once, and only read it in afterwards (you could use Storable, or plain spewing using File::Slurp). This should drastically reduce your memory consumption and speed should be lightning fast.Update: using Bit::Vector::Array would probably be the simplest way to achieve the bitstring (simply use the telefone number as an index). Like that you would only need one OP (no searching needed, just check that index of the bit vector whether its 1 or 0) for an lookup. Initialization of the bit vector would be just as easy, simply toggle the bits with the correct index to 1.

Hope this gives you some ideas.

Cheers Roland

Replies are listed 'Best First'.
Re^2: Searching text files
by Skeeve (Parson) on Sep 14, 2006 at 23:52 UTC

    #3 is the idea I like most

    I don't know much about american phone numbers, but if they all have a fixed length of 10, you'd just need slightly more than 1GB disk space to store one bit for each existing number.

    I wouldn't create this bit vector in memory. Just create a big enough file, initialized with zeros and then go through your text file and position with fseek to $phone_num >> 3 and set bit number $phone_num & 7.

    do the same positioning for read access, but check the bit.

    I think searching will be done in less than a second.

    Update: Of course you can couple this with the idea of splitting for each area code. This should reduce the summed size of your three files to 1/333 (about 4MB) if the area code has 3 numbers.

    Update #2: If you have 10 numbers in each phone number and have 2million numbers you already have 21MB disk space used. So the bit vector on disk will save you 16MB.


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Update: I Misread your post, your 4MiB are for 3 area codes, thus the same result (1.2MiB per Area Code). Being able to read cleary is an advantage. Sorry.

      As you stated, the most practical approach would be to split it by area code. The point where i disagree is, that you think that it would eat up 4MB of space per area code.
      rminner@Rosalinde:~$ bc bc 1.06 Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. obase=1024 10^7-1 0009 0549 0639 last/8 0001 0196 0719
      i get 10 Million Bits (minus 1) for 7 digits. That would mean roughly 1.2 MiB and not 4MiB. Depending on the amount of memory available, you could load only a limited number of area codes. Like this the data structure for all do not call numbers in one are should be just a little bit more than those 1.2 MiB. Thus 5 Area Codes would only eat up 6 MiB, and as i said, a lookup would be instantaneous (from a user perspective) as it requires only to check one bit. One could allocate a limited number of slots for area codes, and could free them using whatever replacement algorithm one prefers (for example LRU or LFU). Loading should be also fast using File::Slurp, as directly slurping 1MiB into Memory using sysread, should be really fast when DMA is active.(you could also seek directly in the file (as stated by skeeve), reducing it to a single seek statement is also possible, keeping memory consumption even lower, and just requiring a single hd seek.)
      The Caching of the bitstring could be done very easily. Simply store the bitstring in a file, with the same name but for example with the extension .bin . Afterwards set the same mtime for the .bin file as for the .txt file. Later if the mtime is identical, you can use your precomputed bitstring and if the mtimes differs, the txt file has been modified, and the .bin file can be recreated from scratch (also shouldn't take more than 1-2 seconds). Like this your data would be always up-to-date just using plain .txt files, but speed should still be more or less instantaneous.