in reply to Re^2: Are two lines in the text file equal (!count)
in thread Are two lines in the text file equal

Am I the only one who interprets this question to relate to finding duplicates in a single file?

...a given text file....anywhere in the file...The file contains...

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!

  • Comment on Re: Re^2: Are two lines in the text file equal (!count)

Replies are listed 'Best First'.
Re: Re: Re^2: Are two lines in the text file equal (!count)
by prostoalex (Scribe) on Nov 13, 2003 at 07:05 UTC
    Yes, the goal is to find the duplicated (triplicates??) if any. Unfortunately, the machine with 1-4 GB is not at my disposal, so generating a giant hash is not truly an option.
      You know you keep saying that, but it's not true (you don't need a machine with 1-4GB ), just use DB_File.

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

        Update: Ignore this, my error. I should have checked the return code from tie. If the tie fails, it just creates an in memory hash. The extra memory used is just the overhead of loading of the module.

        I'm probably doing something wrong, but I just tried the following code to detect duplicates in my 80 MB file (1_000_000 lines x 80 chars) and it took close to 1/2 an hour to hash the whole file.

        #! perl -slw use strict; use DB_File; tie %h, 'DB_File', 'test.db'; open IN, '<', 'test.dat' or warn $!; print scalar localtime; $h{ $_ } .= ' ' . $. while $_ = <IN>; print scalar localtime; exit; __END__ Thu Nov 13 20:55:30 2003 Thu Nov 13 21:23:31 2003

        That wasn't much of a surprise, but the fact that it consumed 190 MB of memory doing so was, as this is considerably more than building a straight hash in memory.</strike

        Is there some way of limiting the memory use?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!
        Wanted!