in reply to Re: Memory utilization and hashes
in thread Memory utilization and hashes

Turns out that the unix sort was exactly the prior step that was missing to help speed this up. With a correct choice of keys, the file now is in sequential order by "ID" and when a new Query comes in, it is now easy to check if the current "ID" = the prior "ID" and flush any accumulated hash entries and continue. This keeps the hash to, in testing, no more than 3-7 'extra' keys for each set of "ID"s in the file and then dumps the set.

Memory usage has stayed small and the processing is now approx 1/4 the total time of the prior runs.

Replies are listed 'Best First'.
Re^3: Memory utilization and hashes
by poj (Abbot) on Jan 19, 2018 at 13:34 UTC

    What does this sample of data you provided look like after the *nix sort ?

    Query;1;host;www.example.com Answer;1;ip;1.2.3.4 Query;2;host;www.cnn.com Query;3;host;www.google.com Answer;2;ip;2.3.4.5 Answer;2;ip;2.3.4.5 Query;4;host;www.google.com Answer;4;ip;3.4.5.6 Answer;3;ip;3.4.5.6 Query;2;host;www.example2.com Answer;4;ip;1.2.4.5 Answer;2;ip;2.3.4.5
    poj

      There is actually missing data in the sample data. In the real data file, it includes the date and time of the entry.

      Once sorted by date and ID, then I can be sure that if the date changes and the ID changes as well, then there are no more answers to be had and I can dump the data, empty the hash and move on.

      The real file is more like this once sorted:

      2018-01-25 01:01:01;Query;1;host;www.example.com 2018-01-25 01:01:01;Answer;1;ip;1.2.3.4 2018-01-25 01:01:05;Query;2;host;www.cnn.com 2018-01-25 01:01:05;Answer;2;ip;2.3.4.5 2018-01-25 01:01:05;Answer;2;ip;2.3.4.5 2018-01-25 01:01:06;Query;3;host;www.google.com 2018-01-25 01:01:06;Answer;3;ip;3.4.5.6 2018-01-25 01:01:08;Query;4;host;www.google.com 2018-01-25 01:01:08;Answer;4;ip;3.4.5.6 2018-01-25 01:01:08;Answer;4;ip;1.2.4.5 2018-01-25 01:01:11;Query;2;host;www.example2.com 2018-01-25 01:01:11;Answer;2;ip;2.3.4.5
Re^3: Memory utilization and hashes
by bfdi533 (Friar) on Jan 18, 2018 at 23:46 UTC

    For what is is worth, and if anyone is interested, here are some stats from the processing after I introduced the *nix sort before my perl script.

     elapsed time    | type      |rows after| rows before| pct   | rows/second 
                     |           |processing| processing |smaller| 
     00:03:05.98667  | dns       |  1791555 |    4614653 | 38.82 | 24811.7405403301
     00:03:50.106203 | dns       |  2262736 |    5822777 | 38.86 |  25304.737221708
     00:04:51.91195  | dns       |  2733705 |    7039758 | 38.83 | 24116.0322487654
     00:05:36.348691 | dns       |  3208365 |    8266995 | 38.81 | 24578.6447850335
     00:06:33.947878 | dns       |  3683419 |    9490938 | 38.81 | 24091.8622234589
     00:07:35.58667  | dns       |  4155971 |   10705249 | 38.82 | 23497.7221787459
     00:08:25.086565 | dns       |  4633553 |   11946401 | 38.79 | 23652.1852447214
     00:09:07.952743 | dns       |  5109618 |   13183845 | 38.76 | 24060.1861536808
     00:10:16.250404 | dns       |  5596902 |   14441405 | 38.76 | 23434.3132373833
     00:10:54.578348 | dns       |  6070888 |   15662586 | 38.76 | 23927.7483709253
     00:11:39.012952 | dns       |  6547181 |   16896184 | 38.75 | 24171.4891714911
     00:12:43.13814  | dns       |  7019314 |   18113219 | 38.75 | 23735.1772249255
     00:13:34.23578  | dns       |  7499659 |   19365386 | 38.73 | 23783.5114541392
     00:14:35.939246 | dns       |  7973633 |   20591767 | 38.72 | 23508.2137191967
     00:15:12.223167 | dns       |  8448494 |   21815382 | 38.73 | 23914.5231004641
     00:15:52.951662 | dns       |  8923786 |   23043433 | 38.73 | 24181.1142357817
     00:17:45.637116 | dns       |  9402613 |   24278649 | 38.73 | 22783.2238906363
     00:17:52.402055 | dns       |  9880079 |   25516948 | 38.72 | 23794.1990888856