ttown1079 has asked for the wisdom of the Perl Monks concerning the following question:

Okay, so I have pruned my original script to a very simplistic one that is completely straightforward. It along with the input file example are below. It is just reading in data, splitting, and storing it into a hash. At some moment (around where there are 8 million records in the hash, i get a seg fault. I've reordered data and it produces different results. By this I mean it won't break on one specific line. It seems to have to do with volume. Also, if I reorder the key vectors of the hash, it will work. If I delete the last one, it will work. I am not at a point where I don't know where to turn. Any help would be GREATLY appreciated.
---- #!/usr/bin/perl -w -Iperl use strict; + + $| = 1; + + my $globalCounter = 0; my %mon_log; + + my $month = "03"; my $year = "2004"; + + + + open(REPORTFILE, "inputfile") or die "$0: Can't open data file : $! +\n"; # open file + + open(DEBUGFILE, "> debug") or die "$0: Can't open debug file : $!\n +"; # open file + + while(<REPORTFILE>){ + + my($fmt_proto, $fmt_dest_ip, $fmt_src_ip, $fmt_dest_port, $fmt_src +_port, $fmt_drp_packets, $fmt_country) = split(/\s+/); + + print DEBUGFILE "{$month.$year}{$fmt_proto}{$fmt_dest_ip}{$fmt_des +t_port}{$fmt_src_ip}{$fmt_src_port}\n"; + + if(!exists($mon_log{$month.$year}{$fmt_proto}{$fmt_dest_ip}{$fmt_d +est_port}{$fmt_src_ip}{$fmt_src_port})){ $globalCounter += 1; print DEBUGFILE $globalCounter,"\n"; } + + $mon_log{$month.$year}{$fmt_proto}{$fmt_dest_ip}{$fmt_dest_port}{$ +fmt_src_ip}{$fmt_src_port} += $fmt_drp_packets; } + + close(REPORTFILE); close(DEBUGFILE); exit(0); --- input file format: icmp 123.45.67.89 123.76.243.210 0 11 + 1 United States icmp 142.156.23.54 12.254.215.10 0 11 + 1 United States

Replies are listed 'Best First'.
Re: puzzling seg fault
by dragonchild (Archbishop) on Jun 30, 2004 at 19:00 UTC
    How much RAM is in your machine? It sounds like you're running out. Anything that requires 8 million records in a hash is better done in a database. Period.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: puzzling seg fault
by BrowserUk (Patriarch) on Jun 30, 2004 at 19:15 UTC

    It sounds like you are running out of memory.

    A 1_000_000 key hash with a reference to an empty hash as the value requires around 160 MB. For 8_000_000 keys, that would be 1.2 GB. Even with duplicates at each level, that is still a lot of hashes.

    How much memory do you have in your machine? Have you monitored the memory consuption as the program runs?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
      The machine is an SGI Origin 16 processor HPC with 1 gig ram per processor. When running job accounting, nothing of note seems to happen. I understand a database is desired, and I intend to go that route eventually, but I would like to have both options - database and text files manipulation.

        Nice hardware, but ... 1 GB/Processor? I know nothing of that hardware, but that (again) sounds like any given process will be limited to 1 GB minus any OS overhead. It very much depends upon the distribution of the contents of the file, but I could quite see the 8 million lines building a hash structure > 1 GB.

        Most times when Perl run's out of ram on my machine I get Perl's "Out of memory" error, but occasionally I get a segfault.

        Maybe your job accounting would be telling you if memory was a problem--I've not the vaguest clue what that might contain--and you can rule it out, but if you have access to a top-like live monitoring program, it would be worth checking it out.

        I think I would try filtering the input file into smaller files, say by protocol (assuming there not all icmp) and then process those separately and combine the results.

        You don't actually show what your doing with that monster structure. It looks like your just counting the number of dropped packets per

        date/proto/dst:port/src:port.

        If that is all your are doing, then there little point in building the deep structure. You would achieve the same results by concatenating all those values into a string and using a single level of hash.

        That said, if the problem is memory, that may not help much.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: puzzling seg fault
by tachyon (Chancellor) on Jul 01, 2004 at 04:25 UTC

    I agree with the others that you are running out of memory. I would note that the top level key "$month.year" appears redundant as it is a constant so you could loose that level although that should not make any significant difference. For the specific task you show (counting) you don't need a hash of hash of hash.....just to increment a counter. You could just stringify the key:

    my $key = join '|', "$month.year", $fmt_proto, $fmt_dest_ip, $fmt_dest +_port, $fmt_src_ip, $fmt_src_port; $mon_log{$key} += $fmt_drp_packets;

    This removes all those expensive levels of keys but still gets you your count. You can breakdown the key with split as required. This may use less memory even though the keys are longer and contain redundant data. Memory consumption will depend on how many keys you end up with. It is a crappy way to do it compared to a database. It looks like tab sep data (or it could be) so you could do in MySQL something like:

    create table stuff ( proto char(4), src_ip char(15), etc.... drp_packets int, index(proto), index(src_ip), etc.... ) load data local infile '/blah/blah.dat' into table stuff select sum(packets) where src_ip = 1.2.3.4 and .....

    You can then make any queries you want.....

    What you are actually doing could be handled in a different way. If you sort the input file (unix sort will handle it fine and do it fastest) then you can simply iterate over to generate your counts using a line merge strategy every time you find a new proto/src/dest/port combo:

    my $current_rec = ''; my $current_count = 0; my($proto, $dest_ip, $src_ip, $dest_port, $src_port, $drp_packets, $co +untry, $rec); while(<REPORTFILE>){ ($proto, $dest_ip, $src_ip, $dest_port, $src_port, $drp_packets, $ +country) = split' '; $rec = join "\t", $proto, $dest_ip, $src_ip, $dest_port, $src_port +; if ( $rec eq $current_rec ) { $current_count += $drp_packets; } else { print OUTFILE $current_rec, "\t", $current_count; $current_rec = $rec; $current_count = $drp_packets; } } # now print any hanging rec print OUTFILE $current_rec, "\t", $current_count if $rec eq $current_r +ec;

    This will also be probably an order of magnitude or two faster than using a hash. You can do a sort -nrk6 outfile > drop_sort to sort by dropped packets. BTW your split will split 'United States' into two tokens. use split ' ', $_, 7 to get what you expect.

    cheers

    tachyon

Re: puzzling seg fault
by samtregar (Abbot) on Jun 30, 2004 at 20:37 UTC
    What version of Perl are you running? How much memory does the process use before it dies? Does it die when you run it under the debugger (perl -d) and if so, does the backtrace tell you anything? How about under gdb?

    -sam

Re: puzzling seg fault
by Anonymous Monk on Jul 01, 2004 at 18:53 UTC
    Thanks for all your help. You all make good points which I will consider. However, after talking to the SGI admin, I have access to all of any available memory on the machine. So, whatever is left of 16G from other processes is available to me. At any rate, I knew the version of perl was an older one (version 5.004_05), and he fixed me up with a newer one that I was unaware was installed. Works fine now. First place I should have looked I guess. And yes, I agree a database would be beneficial, but I would like to have both methods as options. Thanks.
Re: puzzling seg fault
by ttown1079 (Initiate) on Jul 01, 2004 at 18:57 UTC
    Tachyon, I just went back and reread your post. Brings up a quick question. I was using a long string as a key instead of {key}{key2}..., and a local perl guy told me the latter method would be better. In hopes of figuring out my problem, I changed it. You think it is better to do it the former? Only thing I could think of is that it allows the array to be stored in chunks that don't have to be contiguous as opposed to forcing the array to be all together. Any opinions on this?