ovedpo15 has asked for the wisdom of the Perl Monks concerning the following question:

So I have an array of hashes like this one:
{ 'number_of_subs' => 5, 'report_map' => 'test5', 'number_of_mains' => 2, 'ystem_start_time' => '1564574677317', 'system_id' => '453521412412', 'timestamp' => 1564574676, 'user' => 'asdasvb' 'mains' => [ { 'main_path' => 'play_ground/MAIN', 'subs' => [ { 'info' => [ { 'version' => [ 'tcsh' ], 'group' => 'pkgs' }, { 'version' => [ '6.13.00' ], 'group' => 'tcsh' } ], 'sub_path' => 'GROUP1/Test1', 'sub_name' => 'Test1' }, { 'data' => [ { 'version' => [ 'tcsh' ], 'group' => 'pkgs' }, { 'version' => [ '6.13.00' ], 'group' => 'tcsh' } ], 'sub_path' => 'GROUP2', 'sub_name' => 'GROUP2' }, { 'info' => [ { 'version' => [ '3.14' ], 'group' => 'A' }, { 'version' => [ '2.56' ], 'group' => 'B' }, { 'version' => [ '6.13.00', '6.14.00' ], 'group' => 'C' } ], 'sub_path' => 'Test1', 'sub_name' => 'Test1' } ], 'main_name' => 'MAIN' }, { 'main_path' => 'play_ground/MAIN1', 'subs' => [ { 'info' => [ { 'version' => [ 'tcsh' ], 'group' => 'pkgs' }, { 'version' => [ '6.13.00' ], 'group' => 'tcsh' } ], 'sub_path' => 'TEST2/SUB1', 'sub_group' => 'SUB1' }, { 'info' => [ { 'version' => [ 'tcsh' ], 'group' => 'pkgs' }, { 'version' => [ '6.13.00' ], 'group' => 'tcsh' } ], 'sub_path' => 'TEST2/SUB2', 'sub_name' => 'SUB2' } ], 'main_name' => 'MAIN1' } ], }
As you can see, I have 'mains' level which contains an array of objects that each one of them contains subs array and main_name and maiin_path fields.
The 'subs' is an array of object where each one of them contains the sub_name, sub_path and info object.
I'm trying to build a hash which contains all the latest blocks. Some examples:
In order to explain it, I will use the following example: (I marked it as main<index> and subs<index>)

First report:
main1: subs1: sub_name: sub1 sub_path: path/to/sub1 info: { group = "ABC", version ="4.2.1" } main_name: ROOT main_path: /PATH/TO/ROOT
Second report:
main1: subs1: sub_name: sub2 sub_path: path/to/sub2 info: { group = "ABC", version = "1.5.6","4.2.1" } main_name: ROOT main_path: /PATH/TO/ROOT
Third report:
main1: subs1: sub_name: sub1 sub_path: path/to/sub1 info: { group = "ABC", version = "1.5.6","4.2.1" } main_name: ROOT main_path: /PATH/TO/ROOT
Fourth report:
main1: subs1: sub_name: sub1 sub_path: path/to/sub1 info: { group = "XYZ", version = "1.5.6","4.2.1" } main_name: ROOT main_path: /PATH/TO/ROOT
Fifth report:
main1: subs1: sub_name: sub1 sub_path: path/to/sub1 info: { group = "XYZ", version = "1.5.6","4.2.1" } main_name: ROOT_OTHER main_path: /PATH/TO/ROOT
Then the merge will be as follows:
Merge of first and second: (Explanation: they have same main_name and main_path but not sub_name and sub_path)
main1: subs1: sub_name: sub1 sub_path: path/to/sub1 info: { group = "ABC", version = "4.2.1" } subs2: sub_name: sub2 sub_path: path/to/sub2 info: { group = "ABC", version = "1.5.6","4.2.1" } main_name: ROOT main_path: /PATH/TO/ROOT
Merge of first and third: (Explanation: will be same as the first report because we take the latest. In that case they have same main, same subs and same info level)
main1: subs1: sub_name: sub1 sub_path: path/to/sub1 info: { group = "ABC", version ="4.2.1" } main_name: ROOT main_path: /PATH/TO/ROOT
Merge of first and fourth: (Explanation: In that case they have same main, same subs and but not same info level)
main1: subs1: sub_name: sub1 sub_path: path/to/sub1 info: { group = "XYZ", version = "1.5.6","4.2.1" },{ group = " +ABC", version ="4.2.1" } main_name: ROOT main_path: /PATH/TO/ROOT
Merge of first and fifth: (Explanation: they have different main_name)
main1: subs1: sub_name: sub1 sub_path: path/to/sub1 info: { group = "ABC", version ="4.2.1" } main_name: ROOT main_path: /PATH/TO/ROOT main2: subs1: sub_name: sub1 sub_path: path/to/sub1 info: { group = "XYZ", version = "1.5.6","4.2.1" } main_name: ROOT_OTHER main_path: /PATH/TO/ROOT
Each one of the reports contains a 'timestamp' so I though of iterating through each one of the blocks and take the latest but it does not feel efficient,
also I will have to keep a field 'timestamp' for each block and than remove it and the end (because I need to compare the time stamp of each iteration).
I would love to hear some suggestion for an algorithm one how to approach this issue.
I also tried to solve this issue from the DB side (link: https://www.perlmonks.org/?node_id=11103616), but I understood that its better to get the report, convert to data-structure and parse it.
The idea I though about - first of all, for each main block, check if it exists in the output_hash (by checking the main_path and main_name), if not insert it as it is and add the timestamp,
if it does exists, add all the subs blocks that are not already included and check those blocks that are the same and take the latest by the timestamp we saved.
It feels like bad efficiency and bad algorithm. Any ideas?
Thank you all. EDIT: What I did until know:
my %output_reports; foreach my $main (@{$data->{"mains"}}) { my $uniq_main = 1; foreach my $new_main (@{$output_reports{"mains"}}) { if ($main->{"main_path"} eq $new_main->{"main_path"} && $main- +>{"main_name"} eq $new_main->{"main_name"}) { $uniq_main = 0; my $uniq_sub = 1; foreach my $sub (@{$main->{"subs"}}) { foreach my $new_sub (@{$new_main->{"subs"}) { if ($sub->{"sub_path"} eq $new_sub->{"sub_path"} & +& $sub->{"sub_name"} eq $new_sub->{"sub_name"}) { # Stuck here - I need the timestamp } } } } } if ($uniq_main) { push(@{$output_reports{"mains"}},$main); } }
I'm stack because in the "subs" I need to use the timestamp that I don't have.

Replies are listed 'Best First'.
Re: Parsing output
by 1nickt (Canon) on Aug 02, 2019 at 21:47 UTC

    Hi there again,

    Any ideas?

    1. Take everything you've learned so far about working with Perl data structures and put it back in your newly-expanded toolbox. You'll never not need it, good job.
    2. Throw out most of your program so far.
    3. Set up a relational database for your data. It's time.

    For the type of query that you want, being able to say select * from sub where subname = 'foo' order by date_modified desc limit 1 or similar simple SQL, is about 95% easier than trying to morph nested Perl data structures from one format into another.

    I understand that you are working with a Mongo DB project, but I am not familiar with Mongo and it looks like you are storing nested structures and yet are not able to query them? If that's so, and you don't just need to read some more Mongo doc, set up an SQLite DB to crunch your data.

    You can even set it up in memory in your process, populate it as you read in data, query it to analyze the data and make reports, and throw it away when the process is finished. I use exactly this technique to build nightly reports from a commercial website, reading in raw data from two or three places, building a temporary SQLite DB, and producing the reports from there.

    The time it takes to figure out how to do that will not greatly exceed the time it takes to build up a huge ugly Perl program to make the same queries. Your code will be a tenth as long and 100 times more maintainable, readable, and robust.

    Hope this helps!


    The way forward always starts with a minimal test.
      Thanks for the suggestion, we already use MongoDB and it will take a lot of time to change it to an SQL DB because we have a lot of reports already in the MongoDB.
      I will still try my luck here in waiting for a Perl suggestion.
      Thank you for this post and all of the other as well!
        You will probably have more luck, narrowing stuff down to pieces you have problems with into short example(s). We are here to help with the language and programming problems, not to solve your business logic.


        holli

        You can lead your users to water, but alas, you cannot drown them.

        Well, good luck. I'll just note that I was not suggesting dumping Mongo. In fact I mentioned that you should be able to query your structured data there (https://docs.mongodb.com/manual/tutorial/query-embedded-documents/). I suggested marshalling your data for insertion into Mongo using a temporary SQL DB instead of the complex and hard-to-maintain Perl data manipulation routines you are struggling with.


        The way forward always starts with a minimal test.
Re: Parsing output
by AnomalousMonk (Archbishop) on Aug 02, 2019 at 20:48 UTC