in reply to How can I keep or discard certain blocks of an XML file based on first line of block?

I've added some attribute names and corrected the last tag in your example to get valid XML.

#!perl use strict; use XML::Twig; use Time::Piece; #open my $IN ,'<','origfile.xml' or die "$!"; open my $OUT,'>','newfile.xml' or die "$!"; my $twig = XML::Twig->new( twig_roots => { 'label_a' => \&label_a }, # process label_a blocks twig_print_outside_roots => $OUT, # print rest ); $twig->set_pretty_print('indented'); $twig->parse(\*DATA); # or $IN # process sub label_a { my ($twig,$e) = @_; my $timea = $e->att('timea'); my $id = $e->att('id'); my $t = Time::Piece->strptime($timea,'%Y%m%d%H%M%S %z'); my $hr = $t->hour; my $day = $t->day; my $keep = 0; $keep=1 if (( $id =~ /match this/ ) && ( $hr >=18 && $hr <=20 ) && ( $day =~ /Mon|Tue|Wed/i )); # $keep=1 if ... another condition if ($keep == 1) { $twig->flush($OUT); # save } else { $twig->purge(); # discard print STDOUT $t->strftime." $id $hr $day skipped\n"; } } __DATA__
<?xml version="1.0" encoding="ISO-8859-1"?> <label_x data1="somevalue" data2="someothervalue" data3="anothervalue" +> <label_y y="somevalue"> <label_z>a value</label_z> <label_z>a value</label_z> <label_z>a value</label_z> <label_z>a value</label_z> </label_y> <label_y y="somevalue"> <label_z>a value</label_z> <label_z>a value</label_z> <label_z>a value</label_z> <label_z>a value</label_z> </label_y> <label_a timea="20140623203000 -0400" timeb="20140623210000 -0400" + id="must_match_this"> <label_b b="data">data_of_variable_number_of_lines_and_indenta +tions</label_b> <label_b b="more_data">data_of_variable_number_of_lines_and_in +dentations</label_b> <label_c> <label_d>Some_data_may_be_indented_further</label_d> </label_c> <label_b b="still_more_data">data_of_variable_number_of_lines_ +and_indentations</label_b> </label_a> <label_a timea="20140623210000 -0400" timeb="20140623220000 -0400" + id="must_match_this"> <label_b b="data">data_of_variable_number_of_lines_and_indenta +tions</label_b> <label_b b="more_data">data_of_variable_number_of_lines_and_in +dentations</label_b> <label_c> <label_d>Some_data_may_be_indented_further</label_d> </label_c> <label_b b="still_more_data">data_of_variable_number_of_lines_ +and_indentations</label_b> </label_a> </label_x>
poj
  • Comment on Re: How can I keep or discard certain blocks of an XML file based on first line of block?
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: How can I keep or discard certain blocks of an XML file based on first line of block?
by Anonymous Monk on Jun 26, 2014 at 20:03 UTC

    Wow, thank you so much, that is almost perfect. After uncommenting the initial OPEN line and setting $twig->parse to use $IN rather than \*DATA, I have noticed only three things I can't explain. The first is that if I add a print statement to print the value of $hr it appears to print the hour exactly matching the hour field in the string, yet in the comparison it appears to be using the GMT equivalent. In other words, if the string shows timea as 20140623200000 -0400 then if I print $hr it shows 20 but if I am doing the comparison I have to specify 0 (four hours later) and the next day!

    I can easily live with that, but there are a couple other bits of weirdness. One is that when the label_a line is written out, the three data elements are not in the original order. Where originally there was timea, timeb, and id, now it is id, timea, timeb. It appears to be putting the data elements in alphabetical order. While in theory that shouldn't be an issue, I'll need to do some experimentation to see whether it is or not.

    The final thing is that it is writing a LOT of extra blank lines to the output file. The original file contains no blank lines except for one near the top of the file and one near the bottom, but it appears that whenever XML::Twig outputs a label_a block it adds a blank line at the beginning, and (here is what I REALLY can't understand) whenever it skips a label_a block it also leaves a blank line in the new file. Very strange and I really don't understand why it happens, but I will need to see if it makes any difference. If anyone can explain that to me, I would really like to know how to eliminate the excess blank lines.

    But this is miles ahead of where I was last night and you have introduced me to a couple of handy Perl modules, so thank you very much, this was MUCH appreciated!

        poj, thank you again. I read the node you mentioned but wasn't sure if you were implying there's no good solution, or if you were simply telling me to replace twig_roots with twig_handlers. The last person to post there seemed to be taking a shot in the dark, and no one confirmed whether that would work.

        As for keep_atts_order I did discover that in the module docs but what confuses me on that is that if I add

        keep_atts_order => 'TRUE',

        under Windows (using ActivePerl) I get an error message, yet if I do the exact same thing on a Linux box it works with no complaints. Anyway, I did some more testing and found that neither the re-ordering of the data elements nor the blank lines seem to have any adverse affect on operation. The reordered data element don't bother me as much as the blank lines simply because if I am trying to look at the output they make it harder to see everything at a glance, but that's not all that big a deal either because this file isn't really meant to be read by humans, it's just an intermediate step in a process.

        So thank you again and if you have any thoughts on what I can do about those initial blocks, as I explained in my previous post, I would much appreciate hearing them.

      For the attributes order see this section in XML::Twig
      keep_atts_order Setting this option to a true value causes the attribute hash to b +e tied to a Tie::IxHash object. This means that Tie::IxHash needs to +be installed for this option to be available. It also means that the +hash keeps its order, so you will get the attributes in order. This a +llows outputting the attributes in the same order as they were in the + original document.
      poj
        Sorry, I COMPLETELY overlooked the part about needing to install Tie::IxHash. Now keep_atts_order => 'TRUE', works great on the Windows box also! Thanks again and sorry I didn't read more carefully.

      I discovered one more wrinkle. I didn't think the label_y blocks at the top of the file were important, but it turns out they are. Remember the format of those is:

      <label_y="somevalue"> <label_z>a value</label_z> <label_z>a value</label_z> <label_z>a value</label_z> <label_z>a value</label_z> </label_y>

      What I need to do is this:

      Discard all but the first (or last, doesn't matter) of these, so there is only one such block in the file. Or, discard all of them and create a replacement - it would not really matter, just so there is only one.

      If keeping one of the original blocks, save "somevalue" from the first line in a variable such as $somevalue.

      Then, lower in the program, before writing the "kept" label_a blocks to the output file, I need to do this:

      $e->set_att( 'channel' => $somevalue );

      Since what's been posted so far is far more elegant than anything I would have come up with, I'd be interested in knowing a good way to do discard all (or all but one) of those initial blocks, and insert a replacement if needed, so there is only one.

      By the way, relative to my previous post I did discover keep_atts_order => 'TRUE', which works in Linux but for some reason not in ActivePerl in Windows.

        What do you have in <...other lines that must be copied...> ?, and what is the attribute name in <label_y="somevalue">, that is not valid XML.
        To remove the duplicate blocks, add a handler for the label_y tags to cut the element and then print at the end using an end_tag_handler.
        my $label_y; my $twig = XML::Twig->new( twig_roots => { 'label_a' => \&label_a , 'label_y' => \&label_y , }, , end_tag_handlers => { 'label_x' => \&label_x, }, twig_print_outside_roots => $OUT, # print rest ); sub label_y { my ($twig,$e) = @_; $label_y = $e->cut; } sub label_x { my ($twig,$e) = @_; my $y = $label_y->att('y'); $label_y->set_att( channel => $y); $label_y->print; }
        poj
        Okay, I'm an idiot - completely missed the part about needing to install Tie::IxHash. It must have been already installed on the Linux box. After installing it on the Windows box, adding keep_atts_order => 'TRUE', works great! Thank you again!!!!!