Parsing output

ovedpo15 has asked for the wisdom of the Perl Monks concerning the following question:

So I have an array of hashes like this one:

{
'number_of_subs' => 5,
'report_map' => 'test5',
'number_of_mains' => 2,
'ystem_start_time' => '1564574677317',
'system_id' => '453521412412',
'timestamp' => 1564574676,
'user' => 'asdasvb'
'mains' => [
            {
            'main_path' => 'play_ground/MAIN',
            'subs' => [
                        {
                            'info' => [
                                        {
                                        'version' => [
                                                    'tcsh'
                                                    ],
                                        'group' => 'pkgs'
                                        }, 
                                        {
                                        'version' => [ 
                                                    '6.13.00'
                                                    ],
                                        'group' => 'tcsh'
                                        }
                                    ], 
                            'sub_path' => 'GROUP1/Test1',
                            'sub_name' => 'Test1'
                        },
                        { 
                            'data' => [
                                        {
                                        'version' => [ 
                                                    'tcsh'
                                                    ],
                                        'group' => 'pkgs'
                                        }, 
                                        {
                                        'version' => [ 
                                                    '6.13.00'
                                                    ],
                                        'group' => 'tcsh'
                                        }
                                    ], 
                            'sub_path' => 'GROUP2',
                            'sub_name' => 'GROUP2' 
                        },
                        { 
                            'info' => [
                                        {
                                        'version' => [ 
                                                    '3.14' 
                                                    ],
                                        'group' => 'A' 
                                        }, 
                                        {
                                        'version' => [ 
                                                    '2.56' 
                                                    ],
                                        'group' => 'B' 
                                        }, 
                                        {
                                        'version' => [ 
                                                    '6.13.00',
                                                    '6.14.00'
                                                    ],
                                        'group' => 'C' 
                                        }
                                    ], 
                            'sub_path' => 'Test1', 
                            'sub_name' => 'Test1'
                        } 
                        ],
            'main_name' => 'MAIN' 
            }, 
            {
            'main_path' => 'play_ground/MAIN1',
            'subs' => [
                        {
                            'info' => [
                                        {
                                        'version' => [
                                                    'tcsh'
                                                    ],
                                        'group' => 'pkgs'
                                        },
                                        {
                                        'version' => [
                                                    '6.13.00'
                                                    ],
                                        'group' => 'tcsh'
                                        }
                                    ],
                            'sub_path' => 'TEST2/SUB1',
                            'sub_group' => 'SUB1'
                        },
                        {
                            'info' => [
                                        {
                                        'version' => [
                                                    'tcsh'
                                                    ],
                                        'group' => 'pkgs'
                                        },
                                        {
                                        'version' => [
                                                    '6.13.00'
                                                    ],
                                        'group' => 'tcsh'
                                        }
                                    ],
                            'sub_path' => 'TEST2/SUB2',
                            'sub_name' => 'SUB2'
                        }
                        ],
            'main_name' => 'MAIN1'
            }
        ],
}
[download]

As you can see, I have 'mains' level which contains an array of objects that each one of them contains subs array and main_name and maiin_path fields.
The 'subs' is an array of object where each one of them contains the sub_name, sub_path and info object.
I'm trying to build a hash which contains all the latest blocks. Some examples:
In order to explain it, I will use the following example: (I marked it as main<index> and subs<index>)

First report:

main1:
    subs1:
        sub_name: sub1
        sub_path: path/to/sub1
        info: { group = "ABC", version ="4.2.1" }
    main_name: ROOT
    main_path: /PATH/TO/ROOT
[download]

Second report:

main1:
    subs1:
        sub_name: sub2
        sub_path: path/to/sub2
        info: { group = "ABC", version = "1.5.6","4.2.1" }
    main_name: ROOT
    main_path: /PATH/TO/ROOT
[download]

Third report:

main1:
    subs1:
        sub_name: sub1
        sub_path: path/to/sub1
        info: { group = "ABC", version = "1.5.6","4.2.1" }
    main_name: ROOT
    main_path: /PATH/TO/ROOT
[download]

Fourth report:

main1:
    subs1:
        sub_name: sub1
        sub_path: path/to/sub1
        info: { group = "XYZ", version = "1.5.6","4.2.1" }
    main_name: ROOT
    main_path: /PATH/TO/ROOT
[download]

Fifth report:

main1:
    subs1:
        sub_name: sub1
        sub_path: path/to/sub1
        info: { group = "XYZ", version = "1.5.6","4.2.1" }
    main_name: ROOT_OTHER
    main_path: /PATH/TO/ROOT
[download]

Then the merge will be as follows:
Merge of first and second: (Explanation: they have same main_name and main_path but not sub_name and sub_path)

main1:
    subs1:
        sub_name: sub1
        sub_path: path/to/sub1
        info: { group = "ABC", version = "4.2.1" }
    subs2:
        sub_name: sub2
        sub_path: path/to/sub2
        info: { group = "ABC", version = "1.5.6","4.2.1" }
    main_name: ROOT
    main_path: /PATH/TO/ROOT
[download]

Merge of first and third: (Explanation: will be same as the first report because we take the latest. In that case they have same main, same subs and same info level)

main1:
    subs1:
        sub_name: sub1
        sub_path: path/to/sub1
        info: { group = "ABC", version ="4.2.1" }
    main_name: ROOT
    main_path: /PATH/TO/ROOT
[download]

Merge of first and fourth: (Explanation: In that case they have same main, same subs and but not same info level)

main1:
    subs1:
        sub_name: sub1
        sub_path: path/to/sub1
        info: { group = "XYZ", version = "1.5.6","4.2.1" },{ group = "
+ABC", version ="4.2.1" }
    main_name: ROOT
    main_path: /PATH/TO/ROOT
[download]

Merge of first and fifth: (Explanation: they have different main_name)

main1:
    subs1:
        sub_name: sub1
        sub_path: path/to/sub1
        info: { group = "ABC", version ="4.2.1" }
    main_name: ROOT
    main_path: /PATH/TO/ROOT
main2:
    subs1:
        sub_name: sub1
        sub_path: path/to/sub1
        info: { group = "XYZ", version = "1.5.6","4.2.1" }
    main_name: ROOT_OTHER
    main_path: /PATH/TO/ROOT
[download]

Each one of the reports contains a 'timestamp' so I though of iterating through each one of the blocks and take the latest but it does not feel efficient,
also I will have to keep a field 'timestamp' for each block and than remove it and the end (because I need to compare the time stamp of each iteration).
I would love to hear some suggestion for an algorithm one how to approach this issue.
I also tried to solve this issue from the DB side (link: https://www.perlmonks.org/?node_id=11103616), but I understood that its better to get the report, convert to data-structure and parse it.
The idea I though about - first of all, for each main block, check if it exists in the output_hash (by checking the main_path and main_name), if not insert it as it is and add the timestamp,
if it does exists, add all the subs blocks that are not already included and check those blocks that are the same and take the latest by the timestamp we saved.
It feels like bad efficiency and bad algorithm. Any ideas?
Thank you all. EDIT: What I did until know:

my %output_reports;
foreach my $main (@{$data->{"mains"}}) {
    my $uniq_main = 1;
    foreach my $new_main (@{$output_reports{"mains"}}) {
        if ($main->{"main_path"} eq $new_main->{"main_path"} && $main-
+>{"main_name"} eq $new_main->{"main_name"}) {
            $uniq_main = 0;
            my $uniq_sub = 1;
            foreach my $sub (@{$main->{"subs"}}) {
                foreach my $new_sub (@{$new_main->{"subs"}) {
                    if ($sub->{"sub_path"} eq $new_sub->{"sub_path"} &
+& $sub->{"sub_name"} eq $new_sub->{"sub_name"}) {
                        # Stuck here - I need the timestamp
                    }
                }
            }
        }
    }

    if ($uniq_main) {
        push(@{$output_reports{"mains"}},$main);
    }
}
[download]

I'm stack because in the "subs" I need to use the timestamp that I don't have.

Comment on Parsing output Select or Download Code

Replies are listed 'Best First'.
Re: Parsing output by 1nickt (Canon) on Aug 02, 2019 at 21:47 UTC
Hi there again, Any ideas? Take everything you've learned so far about working with Perl data structures and put it back in your newly-expanded toolbox. You'll never not need it, good job. Throw out most of your program so far. Set up a relational database for your data. It's time. For the type of query that you want, being able to say `select * from sub where subname = 'foo' order by date_modified desc limit 1` or similar simple SQL, is about 95% easier than trying to morph nested Perl data structures from one format into another. I understand that you are working with a Mongo DB project, but I am not familiar with Mongo and it looks like you are storing nested structures and yet are not able to query them? If that's so, and you don't just need to read some more Mongo doc, set up an SQLite DB to crunch your data. You can even set it up in memory in your process, populate it as you read in data, query it to analyze the data and make reports, and throw it away when the process is finished. I use exactly this technique to build nightly reports from a commercial website, reading in raw data from two or three places, building a temporary SQLite DB, and producing the reports from there. The time it takes to figure out how to do that will not greatly exceed the time it takes to build up a huge ugly Perl program to make the same queries. Your code will be a tenth as long and 100 times more maintainable, readable, and robust. Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l]
Re^2: Parsing output by ovedpo15 (Pilgrim) on Aug 03, 2019 at 11:42 UTC
Thanks for the suggestion, we already use MongoDB and it will take a lot of time to change it to an SQL DB because we have a lot of reports already in the MongoDB. I will still try my luck here in waiting for a Perl suggestion. Thank you for this post and all of the other as well!	[reply]
Re^3: Parsing output by holli (Abbot) on Aug 03, 2019 at 12:16 UTC
You will probably have more luck, narrowing stuff down to pieces you have problems with into short example(s). We are here to help with the language and programming problems, not to solve your business logic. holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^4: Parsing output by ovedpo15 (Pilgrim) on Aug 03, 2019 at 12:25 UTC
Re^5: Parsing output (updated) by haukex (Archbishop) on Aug 03, 2019 at 12:56 UTC
Re^3: Parsing output by 1nickt (Canon) on Aug 03, 2019 at 14:10 UTC
Well, good luck. I'll just note that I was not suggesting dumping Mongo. In fact I mentioned that you should be able to query your structured data there (https://docs.mongodb.com/manual/tutorial/query-embedded-documents/). I suggested marshalling your data for insertion into Mongo using a temporary SQL DB instead of the complex and hard-to-maintain Perl data manipulation routines you are struggling with. The way forward always starts with a minimal test.	[reply]
Re: Parsing output by AnomalousMonk (Archbishop) on Aug 02, 2019 at 20:48 UTC
... (link: https://www.perlmonks.org/?node_id=11103616) ... For future reference: The site convention `[id://`node number`]` e.g., `[id://11103616]` or (OT) Perl and creating a query for MongoDB, will linkify a node number automatically. Please see What shortcuts can I use for linking to other information? for (lots) more info. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]