Re: Force perl to release memory back to the operating system

Replies are listed 'Best First'.
Re: Re: Force perl to release memory back to the operating system by demerphq (Chancellor) on Sep 25, 2003 at 11:45 UTC
In fact 20 million lines of data in flat files is a very solid indication it is time to put the data into a RDBMS. On the contrary. Not putting them in a DB can make things _much_ more efficient. The real deciding factor is not the data load size, but the data access requirements and volatility of the data. For instance I receive about 2 gigs worth of records every day that I have to process. Never in their life do these records see a DB. Just loading them into a DB and indexing them is signifigantly slower than the processing I need to do. And I only need to do that processing once (or very rarely twice). RDBMS are not IMO suitable for managing large volumes of data that are only going to be accessed once or twice, never are changed, and can be disposed of once processed. Anyway, just thought Id throw that in there. --- demerphq _{First they ignore you, then they laugh at you, then they fight you, then you win. -- Gandhi}	[reply] [d/l]
Re: Re: Re: Force perl to release memory back to the operating system by tachyon (Chancellor) on Sep 25, 2003 at 11:52 UTC
Yes I did make a couple of assumptions about what look like finacial transactions in multiple currencies that are cumulated and then indexed against against customer details to get the most active customer list by transaction value. I can't think why that looked like a DB task :-) But of course you are right. I won't be putting my web or squid logs into a DB anytime soon although I do rotate them daily and use YAML to serialize data we parse out into flat files so we can get it back as required (rarely and for static HTML anyway) As always tools and tasks..... cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Re: Re: Re: Force perl to release memory back to the operating system by demerphq (Chancellor) on Sep 25, 2003 at 12:13 UTC
Yes I did make a couple of assumptions Er, I hope you didnt misunderstand me. My reply wasnt to your whole node or the analysis you offered, just more of a heads up to people that DB's are not the universal pancea for handling large volumes of data. (Not so much for you, as Im familiar with you and was confident you knew the caveats, but for other monks out there that might be tempted to bulk load a few million records just to sum one field in them.) I find theres a tendency amongst DB'ish types to treat DBMS as the only way to solve problems. There are huge classes of problems where using a DBMS is a no brainer (transactions, distribution, integrity, locking etc), but there are also huge classes of problems where a DB is probably not the best solution. As you said. Tools for tasks... Anyway, im hoping this exchange opens one or two minds to alternate approaches which can only be a good thing. Cheers, :-) --- demerphq _{First they ignore you, then they laugh at you, then they fight you, then you win. -- Gandhi}	[reply] [d/l]
Re: Re: Re: Force perl to release memory back to the operating system by poqui (Deacon) on Sep 25, 2003 at 15:13 UTC
I agree. I have done match/merge processing with large lists much faster in flat files with unix commands than loading them into Oracle, for example, and trying to do the processing there. The data was very thin, less than 10 columns, but fairly large, about 1M records.	[reply]
Re: Re: Force perl to release memory back to the operating system by Roger (Parson) on Sep 25, 2003 at 06:19 UTC
Hi Tachyon, thanks for your suggestion with the database approach! And yes I have access to Sybase database. Unfortunately I cann't load them into the database. All these files come from who-knows-what systems, and I have no control over their creations (and I don't want to know either). And I specifically avoided to load them into database because I was told not to. Sorry I forgot to mention that I am only allowed to work on the flat files, and the extraction is run only once per day. Why can't load these data into database in the first place? I don't know. That's the probably an old business decision.	[reply]
Re: Re: Re: Force perl to release memory back to the operating system by mirod (Canon) on Sep 25, 2003 at 07:18 UTC
Are you not allowed to use any DB, or is the decision just to not allow you to use the company Sybase, so you don't bother the DBA? If you use the DBI with DBD::SQLite there will be no need for a DB server (it's included in the module) and the whole DB will be a single file. This removes any administration, the DBA is happy, your boss should be happy too, and you gain scalability and the convenience of using SQL to work on your data. SQLite is really pretty fast, and ideally suited for this kind of single-user application. Just test whether the DB file doesn't grow past your OS limit if there is one and you should be fine.	[reply]
Re: Re: Re: Force perl to release memory back to the operating system by tachyon (Chancellor) on Sep 25, 2003 at 07:03 UTC
It might be time to rethink that business decision. Anyway you can save some memory if you think about what you are doing. # get the customer => blance hash mapping - if you can't RDBMS you are + stuck with this $cust_bal = ...... # now all we are interested in is the top 30 # we don't need to sort them all using a Schwartzian # we only need to sort the values. This saves us a lot of memory as # [ cust, bal ] will take up a lot more space than just bal # we can also exit at this point and write a new temp flat file # with all the cust => bal mappings and this reclaims the memory. # either way we just want those top 30 so just sort the balances: my @bals = sort { $b <=> $a } values %$cust_bal # we are only interested in the top 30 which is a balance greater than +.... my $gt = $bal[30] # so now we iterate over our hash again (or the flat file) my $top_30; for my $cust( keys %$cust_bal ) { next unless $cust_bal > $gt; $top_30->{$cust} = $cust_bal->{$cust}; } # now the top_30 are in a hash ready to be sorted for mapping to cust +details and output [download] This will save you roughly 25% and up to 50% (by using a temp file) of the memory that the original algorithm used which might get you home. If not you will have to tie your hash to a file. dbmopen might be the DB you have when you are not having a DB :-) If you can't do it in memory and have to tie to the disk it will be really slow, but I gues you have got all day :-) cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: Re: Re: Force perl to release memory back to the operating system by Roger (Parson) on Sep 25, 2003 at 11:56 UTC
Hi brother Tachyon, thanks for your suggestion! It has openned up my mind. I have never thought about the memory savings by using a straight sort. Too bad I am back home already, I can't wait to get back to office tomorrow morning to try it out. Thanks again for your prompt reply! I really appreciate it. I will let you how I go tomorrow!	[reply]
Re: Re: Re: Re: Force perl to release memory back to the operating system by Jasper (Chaplain) on Sep 25, 2003 at 10:43 UTC
I don't think this is solving anything, because (as I understand the OP) the files are of transactions, not customers and balances. If each entry in the file was a customer and a balance, one run through would be sufficient, just keeping the top 30 customers. It's more complicated than that, and a nightmare without a RDB.	[reply]
Re: Re: Re: Re: Re: Force perl to release memory back to the operating system by tachyon (Chancellor) on Sep 25, 2003 at 11:01 UTC
Re: Re: Re: Force perl to release memory back to the operating system by zengargoyle (Deacon) on Sep 25, 2003 at 07:24 UTC
i vote for the database as well. even if you have to install mysql yourself. but if you must you can try something like this. #!/usr/bin/perl # # foo - called by bar. recieves files to search on it's # input, calculates top N, retrieves the details for # the top N and prints them to STDOUT in some easily # parseable format # use strict; use warnings; $\|++; my @files = <>; chomp @files; for (@files) { # process files warn "processing $_$/"; } my @top = ( [ a => 10 ], # get top N [ b => 5 ], [ c => 3 ], ); my %detail = ( # and detail info a => [ 1, 1, 4, 2, 1, 1 ], b => [ 1, 2, 2 ], c => [ 1, 2 ], ); for (@top) { printf "%s %d$/", @$_; printf "@{$detail{$_->[0]}}$/"; } exit; #!/usr/bin/perl # # bar - calls foo as a child process and colects tidbits # of data for future use. foo's memory will go back # to the system soon after the answers are read. # use strict; use warnings; use IPC::Open2; my @files = qw( xxx xxy xxz ); my ($read, $write, $child); $child = open2($read, $write, './foo'); print $write $_,$/ for @files; close $write; my @info = <$read>; close $read; waitpid $child, 0; print @info; [download]	[reply] [d/l]