Hello! I'm hitting a brick wall here... I have a script that loads a CSV file of around 800k lines, they're firewall logs, I'm trying to pull out the IP address and the URL they're hitting. I take the IP, check the hash to see if we've seen the Ip already, if not, creates a new entry in the hash for it and creates an array which will hold a list of URLs. If it has seen the Ip before, it pulls the array of URLs from the Hash, adds the next URL to it, and sticks it back in the hash, and moves on. It's fine for say.... a few thousand lines... then it slows to crawl. around 100k, it comes to almost a halt. CPU is high, memory usage is around 7% of system, so fairly low. I let it run with the full dataset and after 30 min it never finished. 50k entries takes about 60 seconds, 100k takes 180 seconds... I feel like it's the 'exists' check on the Hash, but how can I make it faster? Here's the code:
foreach (@list) { # my $entry=time(); $linecounter++; #split the log entry up into an array; source IP is field 7; U +RL is 31; #PALO URL LOGS ONLY! my @message=split(',',$_); my $ip=$message[7]; my $url=$message[31]; #Check if we've seen this IP already in the Hash, if not add i +t to the hash; if (!(exists $ipURL{$ip})) { # print "Doesn't Exist... adding\n"; my @urlList; push(@urlList,$url); $ipURL{$ip}= \@urlList; } else { # print "Defined\n"; my @urlList=@{$ipURL{$ip}}; push (@urlList,$url); $ipURL{$ip}=\@urlList; } if (!($linecounter % 50000)) { print "Lines: $linecounter\n"; } } formatOutput(\%ipURL); # print Dumper \%ipURL;
Here's how the structure is with a very small dataset (4 lines)
perl urlListbyIP.pl List Length:5 Formatting Output... $VAR1 = { '192.168.102.120' => [ '"autodiscover-s.outlook.com/"', '"outlook.office365.com/"' ], 'Source address' => [ 'URL/Filename' ], '192.168.101.208' => [ '"logmeinrescue.com/"', '"logmeinrescue.com/"' ] }; List End:7 Execution Time: 0.01 s
I have another similar script that loads 3.5m lines and it compares each line with a few if $_=~/REGEX/ lines, and that finishes in 25-30 seconds, I dont get why this is so much slower. The delay is definately in the foreach loop on @list, as it never gets to the formatOutput() sub. Please help!

In reply to Hash Search is VERY slow by rtjensen

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.