Re: XML data extraction (updated x2)

Whenever I hear "big XML file" I think XML::Twig, as this can efficiently process the XML file record by record without loading the whole thing into memory. The following gives you the desired output. As for your example code, I don't think you can mix XML::XPath with XML::LibXML - I think it'd be best if you use only tried to use the operations provided by XML::XPath.

use warnings;
use strict;
use XML::Twig;
use Data::Dumper;

my $file = 'OpenApi.xml';
my @records;
XML::Twig->new(
    twig_roots => {
        '/nodes/node/children/node/children/node' => sub {
            my ($t, $elt) = @_;
            my $dim = $elt->first_child('dimension');
            push @records, {
                name => $elt->att('name'),
                citype => $elt->att('ciType'),
                status => $dim->att('status'),
                Time => $dim->first_child('body')
                    ->first_child('entry[@key="Last Status Change"]')
                    ->text  };
            $t->purge;
        },
    },
)->parsefile($file);
print Dumper(\@records);
[download]

Update: As for your code, it's just a matter of getting the XPath expression right, this also gives the desired output:

use strict;
use warnings;
use XML::XPath;
use Data::Dumper;

my $bamxml = 'OpenApi.xml';
my $bamxp = XML::XPath->new(filename => $bamxml);
my $bamxpath =  $bamxp->findnodes('//nodes/node/children/node/children
+/node');

my @records;
foreach my $bamnode ($bamxpath->get_nodelist) {
    my $name   = $bamxp->find('./@name',$bamnode)->string_value;
    my $citype = $bamxp->find('./@ciType',$bamnode)->string_value;
    my $status = $bamxp->find('./dimension/@status',$bamnode)->string_
+value;
    my $time = $bamxp->find('./dimension/body/entry[@key="Last Status 
+Change"]',$bamnode)->string_value;
    s/^\s+|\s+$//g for $name,$citype,$status,$time;
    push @records, {
        name => $name,
        citype => $citype,
        status => $status,
        Time => $time
    };
}
print Dumper(\@records);
[download]

Update 2: Oops, missed your requirement "want to read node only where ciType='application'". The same XPath that choroba showed works in my code samples: '/nodes/node/children/node/children/node[@ciType="application"]'

Comment on Re: XML data extraction (updated x2) Select or Download Code

Replies are listed 'Best First'.
Re^2: XML data extraction (updated x2) by snehit.ar (Beadle) on Oct 12, 2017 at 06:47 UTC
Thanks haukex for help. It will be grateful if you can help to correct below two queries : Want to calculate the Time Stamp between date now and timereceived . Getting error in pattern not matching ,can u share the correct pattern for "10/10/2017 11:35 PM" `#Begin###Calculate the time difference my $dtnow = DateTime->now; my $timereceived = "10/10/2017 11:35 PM"; my $strp = DateTime::Format::Strptime->new(on_error=>'croak',pattern = +> '%m/%d/%Y %H:%M %t', time_zone=>'UTC'); my $dtevent = $strp->parse_datetime($timereceived); my $diff_sec = $dtnow->subtract_datetime_absolute($dtevent)->in_units( +'seconds'); my $diff_hours = sprintf("%.0f" , $diff_sec/(6060)); #End###Calculate the time difference` [download] Another query in expression formatting --- `my name = 'greenfield (Glossary) (100)' foreach ( $name =~ /$(.?)$/ ) { $appID = $1; }` [download] variable $name is having two value in two different brackets (Glossary) and (100) with below regular expression i am getting output as 'appid' => 'Glossary' But i want 'appid' => '100' it should avoid the first bracket (Glossary) values and only last (100) bracket vales it should pick -Thanks.	[reply] [d/l] [select]
Re^3: XML data extraction by haukex (Archbishop) on Oct 12, 2017 at 07:47 UTC
Since these are new questions unrelated to the rest of the thread, it would be best to post it in a new SoPW thread (but please don't re-post now). pattern for "10/10/2017 11:35 PM" ... `'%m/%d/%Y %H:%M %t'` Have a look at the DateTime::Format::Strptime docs - instead of `%t` you need to use the pattern that matches AM/PM, and instead of `%H` for 24-hour time you need to use the pattern which matches 12-hour times. expression formatting ... two different brackets (Glossary) and (100) Sorry but a single example is not enough to help with a regular expression. For example, can you be sure there will always be exactly two sets of parens in the string? Might there be characters after the second set of parens? Might there even be nested parens? And what strings shouldn't match the regex? Please see How to ask better questions using Test::More and sample data as well as my post here. Since this question is relatively basic, now might be a good time to review perlrequick and/or perlretut. You might find anchors (like `^` and `$`) to be useful, but again, that depends on what the various strings you're matching against look like. Also, regex101 can be a useful tool - note that it is not compatible with some of Perl's more advanced features, but for basic things can be very useful. Minor edits for clarity.	[reply] [d/l] [select]
Re^4: XML data extraction by snehit.ar (Beadle) on Oct 12, 2017 at 09:11 UTC
Am able to get the correct date pattern .Thanks. For regex there will be mostly two sets of parens in single string and always want to pick the values from last set of parens i.e :100 `my $name = "greenfield (Glossary) (100)"; my $appID; foreach ( $name =~ /$(.*?)$/ ) { $appID = $1; } print $appID;` [download] above code gives me output : Glossary	[reply] [d/l]
Re^5: XML data extraction by hippo (Archbishop) on Oct 12, 2017 at 10:34 UTC
Re^5: XML data extraction by haukex (Archbishop) on Oct 12, 2017 at 10:35 UTC