in reply to search a large text file

I have a very large text file (~5g)

Is this a single 5GB file used over and over? Or a new 5GB file each time?

I need to search each time

How many searches do you need to do? How often? What is your target time.

I want to search for 'text2' and retrieve 2 and 3,

How long are the texts? Are they ascii or unicode?

but the sorting process takes ages

How long is "ages"?

The more clearly you explain your task, the more likely that someone will see a viable solution.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: search a large text file
by perl_lover_always (Acolyte) on Feb 08, 2011 at 13:01 UTC
    The file is a single file and it is created once and does not change anymore. so I need to use it as a kind of dictionary! it will not be updated.
    the searches are frequent. would be in the loops. so once I access my file I have source terms that I have to searh in the big file.
    The texts ate max two word length in Unicode format!
    I've run it for few days and still is running! I even split it to small portions (less than 200 mg) and after 24 hours they are still running.

      So in short you have a static 5G dataset, that you need to search frequently.

      I think your best bet would be use a database to index the data, and let it worry about how to create an optimised index.

      I would put the entire file contents into the database, and discard the original file. If each line also contains lots of other stuff that you will not be searching on, then I would still keep it in the database, but I would put it in a different collum without an index so as not to bloat the database to much.

      This really does sound like a perfect application for using a database. Especially of you are generating the file and can load it directly into the DB and cut out the middle man file.

      That said, loading the Db via the tools bulk loader is often faster than loading it via DBI one record at a time.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      This is the ideal application for a hash tied to a file. You might like to take a look at DBM::Deep. This is a well-tested and well-liked implementation of a disk based hash.

      Just use a script to generate your hash once (that will take a while), after that any search will be nearly as fast as a single disk access. Store multiple values either concatenated as a string or better use an array for that. Since DBM::Deep is multilevel, storing a HashofArrays is no further problem

        my hash file will run out of memory! that was the main problem that I could not generate the hash at the first stage! any sample code as a clue?
        well, I have no clue how I can manage that! I have my data in this way:
        pleasant 3 festive 2 period 2 i declare 5 declare resumed 7 resumed the 15 the session 9 session of 13
        how can I do that?