How can I keep or discard certain blocks of an XML file based on first line of block?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm getting on in years and to be honest I'm much more familiar with BASIC than Perl (cut my teeth on a TRS-80 Model I)! But I have come up against a problem and am wondering if Perl might provide a simple solution.

To simplify the issue as much as possible, I have a XML file that I want to read line-by-line and selectively copy to another target XML file. The file will have a format something like this (pay attention to the <label_a ...> ... </label> blocks):

<?xml version="1.0" encoding="ISO-8859-1"?>
<...other lines that must be copied...>

<label_x data1="somevalue" data2="someothervalue" data3="anothervalue"
+>
    <label_y="somevalue">
        <label_z>a value</label_z>
        <label_z>a value</label_z>
        <label_z>a value</label_z>
        <label_z>a value</label_z>
    </label_y>
    <label_y="somevalue">
        <label_z>a value</label_z>
        <label_z>a value</label_z>
        <label_z>a value</label_z>
        <label_z>a value</label_z>
    </label_y>
    <label_a timea="20140623203000 -0400" timeb="20140623210000 -0400"
+ id="must_match_this">
        <label_b="data">data_of_variable_number_of_lines_and_indentati
+ons</label_b>
        <label_b="more_data">data_of_variable_number_of_lines_and_inde
+ntations</label_b>
        <label_c>
            <label_d>Some_data_may_be_indented_further</label_d>
        </label_c>
        <label_b="still_more_data">data_of_variable_number_of_lines_an
+d_indentations</label_b>
    </label_a>
    <label_a timea="20140623210000 -0400" timeb="20140623220000 -0400"
+ id="must_match_this">
        <label_b="data">data_of_variable_number_of_lines_and_indentati
+ons</label_b>
        <label_b="more_data">data_of_variable_number_of_lines_and_inde
+ntations</label_b>
        <label_c>
            <label_d>Some_data_may_be_indented_further</label_d>
        </label_c>
        <label_b="still_more_data">data_of_variable_number_of_lines_an
+d_indentations</label_b>
    </label_a>
</labelx>
[download]

So with that in mind, here's what I need to have happen:

Everything BEFORE the first label_a block must be copied to the new file.

If a label_a block is to be copied, then everything between the <label_a ...> and </label> (including the lines with those tags) must be copied. If the block is not to be copied, then nothing from that block should be copied. The data between those tags may vary and have different indentation levels.

The decision to copy the block should be based on two things. The first is the "timea" value in each <label_a ...> line, which looks something like this:

20140623203000 -0400

This breaks down to 2014-06-23 20:30:00 with a GMT offset of -4:00

The second is the "id" value which is on the same line and will contain one of a few specific strings. What I want to do is this:

IF the ID string matches a particular value then IF the hour is within a certain range (say 18 to 20) then IF it is one of certain specific days of the week (say Monday, Wednesday, or Friday) then I want to copy the entire label_a block to the other file. If ALL of those conditions are not met, then I want to try another similar test using different ID, hour, and day of week values. If NONE of the tests result in the code block being copied, then I want the entire label_a block to be skipped (not copied) and the same series of test run on the next label_a block, looping until the end of the file.

There will not be any stray lines between label_a blocks until the </labelx> block is reached at the end of the file.

The only thing about this that would be difficult in BASIC would be calculating the day of week from the "datea" string. Other than that, I could have this coded in BASIC in about an hour. But I am not that familiar with Perl and don't have any idea how to do the XML block selection and selective writing to the new file, and I definitely don't know how to extract the day of week from that date string. Could one of you kind monks please help get me started in the right direction?

Comment on How can I keep or discard certain blocks of an XML file based on first line of block? Download Code

Replies are listed 'Best First'.
Re: How can I keep or discard certain blocks of an XML file based on first line of block? by Anonymous Monk on Jun 26, 2014 at 11:38 UTC
Have a look at XML::Twig, especially the section "XML::Twig 101". That has code examples and should get you started on how to parse the XML file, select the elements to handle, and only print those that you want to the output. As for parsing the date, here's one of several ways to do that, using the DateTime and DateTime::Format::Strptime modules. `use DateTime; use DateTime::Format::Strptime; my $strp = DateTime::Format::Strptime->new(on_error=>'croak', pattern => '%Y%m%d%H%M%S %z'); my $input = "20140623203000 -0400"; my $dt = $strp->parse_datetime($input); print $dt->iso8601, "\n"; print "Weekday: ", $dt->day_of_week, " (", $dt->day_name, ")\n";` [download] Another thing to note: In your description you talk about line-based parsing, which is a dangerous thing with XML. Many, many applications that work with XML don't consider some or all whitespace, including newlines, to be significant. That means that you may suddenly find two tags that were previously on two lines to show up on the same line, or have a tag and its attributes split over several lines, etc. That's one of the reasons XML should always be parsed with a real XML parser. Fortunately, there are lots available, with lots of different interfaces to suit different applications and different coding styles.	[reply] [d/l]
Re: How can I keep or discard certain blocks of an XML file based on first line of block? by poj (Abbot) on Jun 26, 2014 at 12:09 UTC
I've added some attribute names and corrected the last tag in your example to get valid XML. #!perl use strict; use XML::Twig; use Time::Piece; #open my $IN ,'<','origfile.xml' or die "$!"; open my $OUT,'>','newfile.xml' or die "$!"; my $twig = XML::Twig->new( twig_roots => { 'label_a' => \&label_a }, # process label_a blocks twig_print_outside_roots => $OUT, # print rest ); $twig->set_pretty_print('indented'); $twig->parse(\*DATA); # or $IN # process sub label_a { my ($twig,$e) = @_; my $timea = $e->att('timea'); my $id = $e->att('id'); my $t = Time::Piece->strptime($timea,'%Y%m%d%H%M%S %z'); my $hr = $t->hour; my $day = $t->day; my $keep = 0; $keep=1 if (( $id =~ /match this/ ) && ( $hr >=18 && $hr <=20 ) && ( $day =~ /Mon\|Tue\|Wed/i )); # $keep=1 if ... another condition if ($keep == 1) { $twig->flush($OUT); # save } else { $twig->purge(); # discard print STDOUT $t->strftime." $id $hr $day skipped\n"; } } __DATA__ [download] Read more... (3 kB) poj	[reply] [d/l] [select]
Re^2: How can I keep or discard certain blocks of an XML file based on first line of block? by Anonymous Monk on Jun 26, 2014 at 20:03 UTC
Wow, thank you so much, that is almost perfect. After uncommenting the initial OPEN line and setting $twig->parse to use $IN rather than \*DATA, I have noticed only three things I can't explain. The first is that if I add a print statement to print the value of $hr it appears to print the hour exactly matching the hour field in the string, yet in the comparison it appears to be using the GMT equivalent. In other words, if the string shows timea as 20140623200000 -0400 then if I print $hr it shows 20 but if I am doing the comparison I have to specify 0 (four hours later) and the next day! I can easily live with that, but there are a couple other bits of weirdness. One is that when the label_a line is written out, the three data elements are not in the original order. Where originally there was timea, timeb, and id, now it is id, timea, timeb. It appears to be putting the data elements in alphabetical order. While in theory that shouldn't be an issue, I'll need to do some experimentation to see whether it is or not. The final thing is that it is writing a LOT of extra blank lines to the output file. The original file contains no blank lines except for one near the top of the file and one near the bottom, but it appears that whenever XML::Twig outputs a label_a block it adds a blank line at the beginning, and (here is what I REALLY can't understand) whenever it skips a label_a block it also leaves a blank line in the new file. Very strange and I really don't understand why it happens, but I will need to see if it makes any difference. If anyone can explain that to me, I would really like to know how to eliminate the excess blank lines. But this is miles ahead of where I was last night and you have introduced me to a couple of handy Perl modules, so thank you very much, this was MUCH appreciated!	[reply]
Re^3: How can I keep or discard certain blocks of an XML file based on first line of block? by poj (Abbot) on Jun 26, 2014 at 20:30 UTC
For the blank lines 'problem' see this recent node [SOLVED] XML::Twig's twig_print_outside_roots adds extra blank lines to output. poj	[reply]
Re^4: How can I keep or discard certain blocks of an XML file based on first line of block? by Anonymous Monk on Jun 26, 2014 at 22:19 UTC
Re^3: How can I keep or discard certain blocks of an XML file based on first line of block? by poj (Abbot) on Jun 26, 2014 at 20:42 UTC
For the attributes order see this section in XML::Twig `keep_atts_order Setting this option to a true value causes the attribute hash to b +e tied to a Tie::IxHash object. This means that Tie::IxHash needs to +be installed for this option to be available. It also means that the +hash keeps its order, so you will get the attributes in order. This a +llows outputting the attributes in the same order as they were in the + original document.` [download] poj	[reply] [d/l]
Re^4: How can I keep or discard certain blocks of an XML file based on first line of block? by Anonymous Monk on Jun 26, 2014 at 22:31 UTC
Re^3: How can I keep or discard certain blocks of an XML file based on first line of block? by Anonymous Monk on Jun 26, 2014 at 21:42 UTC
I discovered one more wrinkle. I didn't think the label_y blocks at the top of the file were important, but it turns out they are. Remember the format of those is: `<label_y="somevalue"> <label_z>a value</label_z> <label_z>a value</label_z> <label_z>a value</label_z> <label_z>a value</label_z> </label_y>` [download] What I need to do is this: Discard all but the first (or last, doesn't matter) of these, so there is only one such block in the file. Or, discard all of them and create a replacement - it would not really matter, just so there is only one. If keeping one of the original blocks, save "somevalue" from the first line in a variable such as $somevalue. Then, lower in the program, before writing the "kept" label_a blocks to the output file, I need to do this: `$e->set_att( 'channel' => $somevalue );` Since what's been posted so far is far more elegant than anything I would have come up with, I'd be interested in knowing a good way to do discard all (or all but one) of those initial blocks, and insert a replacement if needed, so there is only one. By the way, relative to my previous post I did discover keep_atts_order => 'TRUE', which works in Linux but for some reason not in ActivePerl in Windows.	[reply] [d/l] [select]
Re^4: How can I keep or discard certain blocks of an XML file based on first line of block? by poj (Abbot) on Jun 27, 2014 at 06:56 UTC
Re^4: How can I keep or discard certain blocks of an XML file based on first line of block? by Anonymous Monk on Jun 26, 2014 at 22:29 UTC
Re: How can I keep or discard certain blocks of an XML file based on first line of block? by Anonymous Monk on Jun 26, 2014 at 01:30 UTC
tl;dr perlintro#Files and I/O perlintro#Parentheses for capturing and perlrequick#Extracting matches DateTime, DateTime::Format::Strptime more portable than Time::Piece strptime Re: parsing xml, xpather.pl/htmltreexpather.pl	[reply]
Re: How can I keep or discard certain blocks of an XML file based on first line of block? by 1s44c (Scribe) on Jun 26, 2014 at 07:46 UTC
BASIC on a TRS-80 :) Nice. It doesn't sound too hard to convert the input to a scalar with join and do a non-greedy pattern match to cut out the parts you don't want. Hacking XML in this way always seems to end up causing more pain than it's worth. You are better off using a XML module from CPAN that has been tested and will cover every corner case. Even if learning how to use that module takes longer it's an investment in the speed of your future coding and the stability of your code.	[reply]