pr33 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to build a Complex Data structure (Hash of hashes) by reading a Large file .

I have the Data such that each Zone in the Stack has some Clusters and each Cluster has some hosts showing their status / Resource Capacity etc..

I am trying to bundle the Data in to a Nested Hash . The First Hash keyed by Zones with Values as the Corresponding Clusters in the Zone and the Second Hash will be keyed by Cluster names and the Host staus, CPU/Memory Capacity ..

Below is my input data

List of Zones in this Stack…….. Zone ID ZONE Name ------------------------------------------------------------- 8f-bx-33 SVM-Zone 72-0f-163 K2PHB 11x-223a-44f K2B-Zone1 SVM-Zone List of HVM Clusters, Hosts Status and its Capacity in this Zone.... Cluster ID Cluster Name Cluster Type Memory OverCom +mit Ratio CPU OverCommit Ratio ---------------------------------------------------------------------- +------------------------------- 6500b1 PO01-Cluster1 HVM 3.0 + 4.2 b2732096 PO046-Cluster1 HVM 1.0 + 2.25 9ff0d432 PO26-CLUSTER01 HVM 1.0 + 3.25 PO01-Cluster1 Host Name No. of Running VMs + CS Host Status CS Resource State ---------------------------------------------------------------------- +-------------------------------------------- cork.example.com 37 + Up Enabled soy.example.com 31 + Up Enabled bot.example.com 25 + Down Enabled bunker.example.com 28 + Maintenance Enabled Total No. of Hosts in this Cluster: 4 No. of HOSTS Up: 3 No. of HOSTS Down: 1 Listing the current capacity in the cluster Resource Type Total Capacity Available +Capacity Used Capacity Used Percentage ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- CPU: 3949740 MHz 592740 M +Hz 3357000 MHz 84.99% MEMORY: 10014 GB 979 GB + 9035 GB 90.22% ACTUAL STORAGE: 41279 GB 24731 GB + 16547 GB 40.09% ALLOCATED STORAGE: 81920 GB 24840 GB + 57079 GB 69.68% PO046-Cluster1 Host Name No. of Running VMs + CS Host Status CS Resource State ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- fort.example.com 20 + Up Enabled server1.example.com 20 + Up Enabled bolverk.example.com 25 + Up Enabled rand.example.com 0 + Down Enabled keystone.example.com 20 + Up Enabled Total No. of Hosts in this Cluster: 5 No. of HOSTS Up: 4 No. of HOSTS Down: 1 Listing the current capacity in the cluster Resource Type Total Capacity Available +Capacity Used Capacity Used Percentage ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- CPU: 3949200 MHz 216325 MHz + 3732875 MHz 94.52% MEMORY: 10077 GB 1381 GB + 8696 GB 86.29% ACTUAL STORAGE: 40960 GB 21700 GB + 19259 GB 47.02% ALLOCATED STORAGE: 81920 GB 15361 GB + 66558 GB 81.25% PO26-CLUSTER01 Host Name No. of Running VMs + CS Host Status CS Resource State ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- cedar.example.com 19 + Up Enabled kentucky.example.com 21 + Down Enabled rose.example.com 19 + Up Enabled melt.example.com 15 + Down Enabled henry.example.com 23 + Up Enabled rant.example.com 23 + Up Enabled rosalind.example.com 26 + Do Enabled Total No. of Hosts in this Cluster: 7 No. of HOSTS Up: 4 No. of HOSTS Down: 3 Listing the current capacity in the cluster Resource Type Total Capacity Available +Capacity Used Capacity Used Percentage ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- CPU: 3949740 MHz 637740 MHz + 3312000 MHz 83.85% MEMORY: 10077 GB 977 GB + 9100 GB 90.3% ACTUAL STORAGE: 41779 GB 15963 GB + 25815 GB 61.79% ALLOCATED STORAGE: 83558 GB 9049 GB + 74508 GB 89.17% K2PHB List of HVM Clusters, Hosts Status and its Capacity in this Zone.... Cluster ID Cluster Name Cluster Typ +e Memory OCR CPU OCR ---------------------------------------------------------------------- +----------------- a95630a82bf PC1-P01-Cluster1 HVM + 1.0 2 441fd92c-163e PC1-P02-Cluster1 HVM + 1.0 2.25 PC1-P01-Cluster1 Host Name No. of Running VMs + CS Host Status CS Resource State ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- pc-lhv01.example.com 22 + Up Enabled pc-lhv02.example.com 20 + Up Enabled pc-lhv03.example.com 25 + Up Enabled Total No. of Hosts in this Cluster: 3 No. of HOSTS Up: 3 No. of HOSTS Down: 0 Listing the current capacity in the cluster Resource Type Total Capacity Available +Capacity Used Capacity Used Percentage ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- CPU: 3510400 MHz 739136 MHz + 2771264 MHz 78.94% MEMORY: 10109 GB 2773 GB + 7336 GB 72.56% ACTUAL STORAGE: 41180 GB 25535 GB + 15645 GB 37.99% ALLOCATED STORAGE: 81920 GB 33547 GB + 48372 GB 59.05% PC1-P02-Cluster1 Host Name No. of Running VMs + CS Host Status CS Resource State ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- nwk-pci-pod02-lhv08.example.com 1 + Up Enabled nwk-pci-pod02-lhv11.example.com 20 + Up Enabled Total No. of Hosts in this Cluster: 2 No. of HOSTS Up: 2 No. of HOSTS Down: 0 Listing the current capacity in the cluster Resource Type Total Capacity Available +Capacity Used Capacity Used Percentage ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- CPU: 3950280 MHz 1234155 MH +z 2716125 MHz 68.76% MEMORY: 10085 GB 2937 GB + 7148 GB 70.87% ACTUAL STORAGE: 41976 GB 18896 GB + 23079 GB 54.98% ALLOCATED STORAGE: 81920 GB 24708 GB + 57211 GB 69.84% K2B-Zone1 List of KVM Clusters, Hosts Status and its Capacity in this Zone.... Cluster ID Cluster Name + Cluster Type Memory OverCommit Ratio CPU OverC +ommit Ratio ---------------------------------------------------------------------- +--------------------------------------------------------------------- +---------------- 08d-b0c9acd8887e K2B-PD1-Cluster1 + HVM 1.0 4 K2B-PD1-Cluster1 Host Name No. of Running VMs + CS Host Status CS Resource State ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- k2b-lhv-01.example.com 0 + Up Enabled k2b-lhv-02.example.com 4 + Up Enabled k2b-lhv-03.example.com 0 + Disconnected Enabled k2b-lhv-04.example.com 0 + Disconnected Enabled Total No. of Hosts in this Cluster: 4 No. of HOSTS Up: 2 No. of HOSTS Down: 2 No. of HOSTS in Disconnected State: 2 Listing the current capacity in the cluster Resource Type Total Capacity Available +Capacity Used Capacity Used Percentage ---------------------------------------------------------------------- +--------------------------------------------------------------------- +----------- CPU: 8073920 MHz 7584920 MH +z 489000 MHz 6.06% MEMORY: 5801 GB 4632 GB + 1169 GB 20.16% ACTUAL STORAGE: 46206 GB 44527 GB + 1678 GB 3.63% ALLOCATED STORAGE: 81920 GB 24708 GB + 57211 GB 69.84%

Below is my Code

#!/usr/bin/perl use warnings; use strict; use Data::Dumper; ####################### my ($zoneid, $zonename); my ($clusid, $clusname); my ($host, $hoststatus); my ($cpu, $cpuusage); my ($mem, $memusage); my ($storage, $storage_used); my %ClusInfo; my @zones; my $clushashref = {}; ######################## sub getZoneInfo { my $file = shift; open my $fh, '<', $file or die "Unable to Open the File $file for +reading: $!\n"; my $on; while (my $line = <$fh>) { chomp $line; if ($line =~ /^Zone/i) { $on = 1; } elsif ($on) { last if $line =~ /^$/; $line =~ s/-{2,}//g; ($zonename) = (split /\s+/, $line)[1]; push @zones, $zonename if defined ($zonename); } } close($fh); # print Dumper \@zones; return \@zones; } sub get_Kvm_Clusters_of_Zones { my $file = shift; my $ZoneAR = &getZoneInfo($file); open my $fh, '<', $file or die "Unable to Open the File $file for +reading: $!\n"; while (my $line = <$fh> ) { chomp $line; $line =~ s/^\s+|\s+$//g; $line =~ s/^\s.*//g; next if $line =~ /^$/; foreach my $zone (@$ZoneAR) { if ($line =~ /(^$zone)/) { $zonename= $1; } elsif ($line =~ /HVM\s+\w+\.\w+/) { ($clusid, $clusname) = (split /\s+/, $line)[0,1]; $ClusInfo{$zonename}{$clusname} = {}; } elsif ($line =~ /No HVM Clusters/) { $ClusInfo{$zonename}{'No HVM'} = 0; } } } close($fh); #print Dumper \%ClusInfo; return \%ClusInfo; } sub get_HostInfo_of_Clusters { my $file = shift; $clushashref = &get_Kvm_Clusters_of_Zones($file); open my $fh, '<', $file or die "Unable to Open the File $file for +reading: $!\n"; while (my $line = <$fh>) { chomp $line; $line =~ s/^\s+|\s+$//g; next if $line =~ /^$/; foreach my $zone (keys %$clushashref) { foreach my $cname (keys %{ $clushashref->{$zone} } ) { next if $cname =~ /No HVM/; if ($line =~ /^($cname)/) { $clusname = $1 } elsif ($line =~ /example\.com/) { ($host, $hoststatus) = (split /\s+/, $line)[0, 2]; chomp $host; $clushashref->{$zone}->{$clusname}->{$host} = $hoststa +tus; chomp $hoststatus; } elsif ($line =~ /^CPU/) { ($cpu, $cpuusage) = (split /\s+/, $line)[0, -1]; $cpuusage =~ s/%//g; chomp $cpuusage; $clushashref->{$zone}->{$clusname}->{$cpu} = $cpuusage if +(defined($cpuusage)); } elsif ($line =~ /^MEMORY/) { ($mem, $memusage) = (split /\s+/, $line)[0, -1]; $memusage =~ s/%//g; chomp $memusage; $clushashref->{$zone}->{$clusname}->{$mem} = $memusage if +(defined($memusage)); } elsif ($line =~ /^ALLOCATED\s+STORAGE/) { ($storage, $storage_used) = (split /\s+/, $line)[1, -1]; $storage_used =~ s/%//g; chomp $storage_used; $clushashref->{$zone}->{$clusname}->{$storage} = $storage_ +used if (defined($storage_used)); } } } } close($fh); print Dumper \%$clushashref; # return $clushashref; } #&getZoneInfo('hvm.txt'); #&get_Kvm_Clusters_of_Zones('hvm.txt'); &get_HostInfo_of_Clusters('hvm.txt');

The First 2 Subroutines returns me the right results . The issue is the last subroutine where it repeats all the Cluster names in the Output for each Zone repeatedly instead of printing only the Clusters/Hosts associated with the Zone .

O/P of Subroutine &getZoneInfo('hvm.txt'); as expected .

$VAR1 = [ 'SVM-Zone', 'K2PHB', 'K2B-Zone1' ]; ------

O/P from &get_Kvm_Clusters_of_Zones('hvm.txt');

$VAR1 = { 'SVM-Zone' => { 'PO26-CLUSTER01' => {}, 'PO01-Cluster1' => {}, 'PO046-Cluster1' => {} }, 'K2PHB' => { 'PC1-P01-Cluster1' => {}, 'PC1-P02-Cluster1' => {} }, 'K2B-Zone1' => { 'K2B-PD1-Cluster1' => {} } }; ---------

O/p from 3rd sub routine. I am only providing the O/p here for one Zone . SVM-Zone should have only 3 Clusters , But it returns a Hash containing all the Clusters for each of the Zone .

'SVM-Zone' => { 'PO26-CLUSTER01' => { 'cedar.example.com' => + 'Up', 'kentucky.example.com' + => 'Down', 'melt.example.com' => +'Down', 'rant.example.com' => +'Up', 'rose.example.com' => +'Up', 'MEMORY:' => '90.3', 'rosalind.example.com' + => 'Do', 'henry.example.com' => + 'Up', 'STORAGE:' => '89.17', 'CPU:' => '83.85' }, 'PC1-P02-Cluster1' => { 'MEMORY:' => '70.87' +, 'nwk-pci-pod02-lhv08 +.example.com' => 'Up', 'nwk-pci-pod02-lhv11 +.example.com' => 'Up', 'CPU:' => '68.76', 'STORAGE:' => '69.84 +' }, 'PO01-Cluster1' => { 'MEMORY:' => '90.22', 'cork.example.com' => ' +Up', 'bunker.example.com' => + 'Maintenance', 'CPU:' => '84.99', 'soy.example.com' => 'U +p', 'bot.example.com' => 'D +own', 'STORAGE:' => '69.68' }, 'PC1-P01-Cluster1' => { 'CPU:' => '78.94', 'pc-lhv01.example.co +m' => 'Up', 'STORAGE:' => '59.05 +', 'pc-lhv03.example.co +m' => 'Up', 'pc-lhv02.example.co +m' => 'Up', 'MEMORY:' => '72.56' }, 'PO046-Cluster1' => { 'server1.example.com' +=> 'Up', 'fort.example.com' => +'Up', 'keystone.example.com' + => 'Up', 'rand.example.com' => +'Down', 'bolverk.example.com' +=> 'Up', 'MEMORY:' => '86.29', 'CPU:' => '94.52', 'STORAGE:' => '81.25' }, 'K2B-PD1-Cluster1' => { 'CPU:' => '6.06', 'k2b-lhv-01.example. +com' => 'Up', 'STORAGE:' => '69.84 +', 'k2b-lhv-02.example. +com' => 'Up', 'k2b-lhv-03.example. +com' => 'Disconnected', 'MEMORY:' => '20.16' +, 'k2b-lhv-04.example. +com' => 'Disconnected' }

I want to store this in Hash of Hash, So I can generate a report such as below for each of the Host with in the Zone.

Zone => Zonename, Cluster => Cluster_Name, Host => Hostname +, HostStatus => Up/Down , CPU => 50, Memory => 50

Replies are listed 'Best First'.
Re: Generate Hash of hashes by reading a large Input file
by haukex (Archbishop) on Apr 12, 2017 at 09:19 UTC

    It's good that you are following some of the "best practices" like splitting your code into subroutines, using three-argument opens with proper error messages, and using Data::Dumper. There are still some other points that could be improved:

    • Your code really needs to be indented properly. See perltidy for a tool to help you with this.
    • You should define your variables in the scope where they are needed, putting them all at the top of the file only makes it a little better than globals. For example, my ($mem, $memusage) can be moved into the innermost elsif where they are used, the same goes for most of the other variables.
    • You use the older &foo() calling style for subroutines, nowadays it's recommended to call subroutines without the &, as in my $clushashref = get_Kvm_Clusters_of_Zones($file);
    • You use chomp a bit too often. Using it once, on the input line immediately after reading it, is enough, all your chomps afterwards will have no effect.

    Although not the source of your problems, this bit of code jumped out at me: $line =~ s/^\s+|\s+$//g; $line =~ s/^\s.*//g;. You're first removing the whitespace from the beginning and end of the line, then the second regex would delete the entire contents of the line if it begins with whitespace, which at this point it does not. In combination, this means the second regex will never do anything, but on its own, the second regex doesn't make much sense to me. If you want to skip lines that begin with whitespace, I think it's easier to just do next if $line=~/^\s/;.

    Anyway, on to the main issues. First of all, this looks like a software-generated report. Do you really need to parse the textual representation, or can this software also generate reports in a machine-readable format, like maybe JSON or XML?

    Second, you parse the file in multiple passes, and then, on the final pass, you loop over both the zones and the clusters on every line. For a short input file like this one, the speed impact is probably not noticeable, but as you add zones and clusters, you'll notice a huge performance degradation. This multiple looping is not necessary. Also, if you can be sure that the report is always in this order, a single pass should be all your need. Personally I'd use a "state machine" kind of approach, I talk about this and gave some examples in this thread, also as I was typing this choroba posted an example.

    Third, the source of your problem in your current code is that in the innermost loop in get_HostInfo_of_Clusters, you always write all information into the $clushashref, without skipping those clusters that aren't part of the current zone. You need a conditional statement there to only save the information when appropriate. A quick fix would be to modify the innermost foreach my $cname loop in get_HostInfo_of_Clusters like so:

    foreach my $cname ( keys %{ $clushashref->{$zone} } ) { next if $cname =~ /No HVM/; if ( $line =~ /^($cname)/ ) { $clusname = $1; next; } next if !$clusname || !$clushashref->{$zone}->{$clusname}; if ( $line =~ /example\.com/ ) { ...

    However, I strongly recommend rewriting the code into a single-pass state machine type approach instead of continuing to work with the current code, as I think you will only run into more problems (performance and maintenance) as you continue working with it.

      Thank you for your suggestion on improving the code . The Files are being generated from an external API which I don't have access to.

      This script would just parse the text file and generate a report on things we are interested in .

      I haven't tried your solution yet . Working on the code choroba have provided which seems much simpler and better .

Re: Generate Hash of hashes by reading a large Input file
by choroba (Cardinal) on Apr 12, 2017 at 08:59 UTC
    I'd create a state machine to parse the file. The state ($section in the below code) tells me what the parser expects to find. I also need to store the current zone and cluster to be able to attach the new information to the correct part of the structure.
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thank you Choroba . I tried your code and execution time has improved a lot upon parsing multiple files of the same input format.

      I was more interested in the Total CPU/Memory/Storage allocation than the Overcommit ratio, So added one more section to the code .

      .... some lines .... ....... } elsif(/^Resource Type/) { $section = 'resources'; ...... ..... elsif ('resources' eq $section) { if (my ($resource, $usage) = /^(CPU:|MEMORY:|ALLOCATED STORAGE:)\s+\S+\s+\S+\s+\S ++\s+\S+\s+\S+\s+\S+\s+(\S+)$/ ) { $usage =~ s/%//g; $zone{$current_zone}{cluster}{$current_cluster}{$resou +rce} = { usage => $usage}; } elsif (/^$/) { $section = 0; } ......