in reply to Force perl to release memory back to the operating system

It seems to me that this is a rather inefficient way to do it. In fact 20 million lines of data in flat files is a very solid indication it is time to put the data into a RDBMS. Given that MySQL and Postgres are pretty good and also Free, along with the fact that Perl has the excellent DBI module there are not many reasons not to. If you just put the customer records into a real DB you would get the following benefits:

  1. You would not need to parse the flat text files every time you wanted to check the results.
  2. Once the initial load is done all you need to do is add the incremental changes.
  3. You would not need gigabytes of memory. Ever
  4. It would be a hell of a lot faster. You can load 10,000+ records/sec into most DBs using their native text load facility. The actual query and dump will probably only take a few seconds (perhaps much faster depending on how you structure the DB).
  5. When the bosses decide they want top 10/20/30/50 by state/zip code/hair color or whatever you have to write a single line of SQL to get the answer. You can even dump this to a tab sep text file ready to import straight into Excel ready for a PPT by the PHB to the visiting VP :-) with a line like SELECT cutomer, balance FROM customers ORDER BY balance DESC LIMIT 30 INTO OUTFILE '/tmp/less_work_is_good.txt'

You could use a hash tied to an underlying file but this limits you to a single key value pair. But all that really does is solve your memory issue (at the expense of a losing lot of speed) so I will let someone else suggest that. Here is one of the many intros to DBI A short guide to DBI to get you started.

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Replies are listed 'Best First'.
Re: Re: Force perl to release memory back to the operating system
by demerphq (Chancellor) on Sep 25, 2003 at 11:45 UTC

    In fact 20 million lines of data in flat files is a very solid indication it is time to put the data into a RDBMS.

    On the contrary. Not putting them in a DB can make things _much_ more efficient. The real deciding factor is not the data load size, but the data access requirements and volatility of the data.

    For instance I receive about 2 gigs worth of records every day that I have to process. Never in their life do these records see a DB. Just loading them into a DB and indexing them is signifigantly slower than the processing I need to do. And I only need to do that processing once (or very rarely twice). RDBMS are not IMO suitable for managing large volumes of data that are only going to be accessed once or twice, never are changed, and can be disposed of once processed.

    Anyway, just thought Id throw that in there.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi


      Yes I did make a couple of assumptions about what look like finacial transactions in multiple currencies that are cumulated and then indexed against against customer details to get the most active customer list by transaction value. I can't think why that looked like a DB task :-)

      But of course you are right. I won't be putting my web or squid logs into a DB anytime soon although I do rotate them daily and use YAML to serialize data we parse out into flat files so we can get it back as required (rarely and for static HTML anyway)

      As always tools and tasks.....

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        Yes I did make a couple of assumptions

        Er, I hope you didnt misunderstand me. My reply wasnt to your whole node or the analysis you offered, just more of a heads up to people that DB's are not the universal pancea for handling large volumes of data. (Not so much for you, as Im familiar with you and was confident you knew the caveats, but for other monks out there that might be tempted to bulk load a few million records just to sum one field in them.)

        I find theres a tendency amongst DB'ish types to treat DBMS as the only way to solve problems. There are huge classes of problems where using a DBMS is a no brainer (transactions, distribution, integrity, locking etc), but there are also huge classes of problems where a DB is probably not the best solution.

        As you said. Tools for tasks...

        Anyway, im hoping this exchange opens one or two minds to alternate approaches which can only be a good thing.

        Cheers, :-)


        ---
        demerphq

          First they ignore you, then they laugh at you, then they fight you, then you win.
          -- Gandhi


      I agree. I have done match/merge processing with large lists much faster in flat files with unix commands than loading them into Oracle, for example, and trying to do the processing there. The data was very thin, less than 10 columns, but fairly large, about 1M records.
Re: Re: Force perl to release memory back to the operating system
by Roger (Parson) on Sep 25, 2003 at 06:19 UTC
    Hi Tachyon, thanks for your suggestion with the database approach! And yes I have access to Sybase database.

    Unfortunately I cann't load them into the database. All these files come from who-knows-what systems, and I have no control over their creations (and I don't want to know either). And I specifically avoided to load them into database because I was told not to.

    Sorry I forgot to mention that I am only allowed to work on the flat files, and the extraction is run only once per day. Why can't load these data into database in the first place? I don't know. That's the probably an old business decision.

      Are you not allowed to use any DB, or is the decision just to not allow you to use the company Sybase, so you don't bother the DBA? If you use the DBI with DBD::SQLite there will be no need for a DB server (it's included in the module) and the whole DB will be a single file. This removes any administration, the DBA is happy, your boss should be happy too, and you gain scalability and the convenience of using SQL to work on your data.

      SQLite is really pretty fast, and ideally suited for this kind of single-user application. Just test whether the DB file doesn't grow past your OS limit if there is one and you should be fine.

      It might be time to rethink that business decision. Anyway you can save some memory if you think about what you are doing.

      # get the customer => blance hash mapping - if you can't RDBMS you are + stuck with this $cust_bal = ...... # now all we are interested in is the top 30 # we don't need to sort them all using a Schwartzian # we only need to sort the values. This saves us a lot of memory as # [ cust, bal ] will take up a lot more space than just bal # we can also exit at this point and write a new temp flat file # with all the cust => bal mappings and this reclaims the memory. # either way we just want those top 30 so just sort the balances: my @bals = sort { $b <=> $a } values %$cust_bal # we are only interested in the top 30 which is a balance greater than +.... my $gt = $bal[30] # so now we iterate over our hash again (or the flat file) my $top_30; for my $cust( keys %$cust_bal ) { next unless $cust_bal > $gt; $top_30->{$cust} = $cust_bal->{$cust}; } # now the top_30 are in a hash ready to be sorted for mapping to cust +details and output

      This will save you roughly 25% and up to 50% (by using a temp file) of the memory that the original algorithm used which might get you home. If not you will have to tie your hash to a file. dbmopen might be the DB you have when you are not having a DB :-) If you can't do it in memory and have to tie to the disk it will be really slow, but I gues you have got all day :-)

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        Hi brother Tachyon, thanks for your suggestion! It has openned up my mind. I have never thought about the memory savings by using a straight sort. Too bad I am back home already, I can't wait to get back to office tomorrow morning to try it out.

        Thanks again for your prompt reply! I really appreciate it. I will let you how I go tomorrow!
        I don't think this is solving anything, because (as I understand the OP) the files are of transactions, not customers and balances.

        If each entry in the file was a customer and a balance, one run through would be sufficient, just keeping the top 30 customers.

        It's more complicated than that, and a nightmare without a RDB.

      i vote for the database as well. even if you have to install mysql yourself. but if you must you can try something like this.

      #!/usr/bin/perl # # foo - called by bar. recieves files to search on it's # input, calculates top N, retrieves the details for # the top N and prints them to STDOUT in some easily # parseable format # use strict; use warnings; $|++; my @files = <>; chomp @files; for (@files) { # process files warn "processing $_$/"; } my @top = ( [ a => 10 ], # get top N [ b => 5 ], [ c => 3 ], ); my %detail = ( # and detail info a => [ 1, 1, 4, 2, 1, 1 ], b => [ 1, 2, 2 ], c => [ 1, 2 ], ); for (@top) { printf "%s %d$/", @$_; printf "@{$detail{$_->[0]}}$/"; } exit; #!/usr/bin/perl # # bar - calls foo as a child process and colects tidbits # of data for future use. foo's memory will go back # to the system soon after the answers are read. # use strict; use warnings; use IPC::Open2; my @files = qw( xxx xxy xxz ); my ($read, $write, $child); $child = open2($read, $write, './foo'); print $write $_,$/ for @files; close $write; my @info = <$read>; close $read; waitpid $child, 0; print @info;