Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a tricky situation. I have a data dump file which contains multiple records in an xml based formating. Each record starts with <ticket .... > and ends with </ticket>. The entire dumpfile is fed into this script and stored in an array.

example of a record entry:
<ticket ... state="...." ....> <version="1.23.4.8"> .... <task> ... </task> .... </ticket>

I have to remove certain records from this data dump which are not needed (based on some conditions) and then perform certain actions on the contents of the "<task> ... </task>" portion of each remaining record. The conditions to be met on each record are:

1. Each Record that contains a certain "state" value, needs to be removed/deleted/not printed in the output. The static list of "state's" is fed into this perl program from a another file.

2. And each Record that has an older version number (comparing only the first 2 digits), needs to be removed. Since the version data in the dupm file is not always just numbers, I also need to delete all those records which have a non-digit charecter in the "version" value. Once these 2 conditions are met, there is other processing done on the <task> entry of each record.

The problem I'm facing is that when I'm running my foreach loop on this dumpfile array, either only able to delete the first line of the record that matches condition 1, but not the full record OR only the line of the record that meets conditon 2, but not the full record. I'm hopting someone can help me figure this out.

The base "verion" number I need to compare each record's (first 2 digits only) "version" number is split and stored in an array variable @base_ver (for example: the value 1.20 was split into 1 and 20 and stored)

My loop is:
foreach(@code){ if (m/^<ticket .*state="(.*)") { $state = $1; } if (m/version="/){ my @vers = split(/="(.*)"/); $version_a = $version[1]; $version_a =~ s/(^[A-Za-z]+)|\.[A-Za-z]+$//; if ($version_a ne '') { my @vers = split(/\./, $version_a); #### Now I compare the 2 versions........ } } ### Now I process the tasks if (m/<task/){ ## do some stuff } print; print "\n"; }

This is not working. Perl monks. Do help.

Replies are listed 'Best First'.
Re: Pattern search in a Multiline Record, from a multi-record datafile.
by BrowserUk (Patriarch) on Mar 02, 2012 at 04:50 UTC

    Rather than reading the file line-by-line and so having each logical record spread over multiple array elements; read teh file record by record:

    open my $fh, '<', 'theDumpFile' or die $!; my @code; { local $/ = '</ticket>'; @code = <$fh>; } close $fh; for my $record ( @code ) { my @linesOfRecord = split "\n", $record; ... }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Pattern search in a Multiline Record, from a multi-record datafile.
by bitingduck (Deacon) on Mar 02, 2012 at 06:24 UTC

    Depending on how big the file is and how XML compliant it is, you might also want to use one of the XML handling modules like XML::Simple, or XML::LibXML. They can simplfy things so you don't have to write a bunch of regex's to handle the different tags. Some of them are also designed to handle huge files reasonably efficiently

    edit:
    realized you aren't rewriting so removed that comment

    If it's not XML compliant, the XML parsers might fail on you (they're supposed to), but something like HTML::TreeBuilder or HTML::TokeParser might also be convenient. I've used both for what I think is a similar task, and they're pretty straightforward.

    more edit
    The XML modules or HTML::TreeBuilder will parse things and maintain all the structure that you're otherwise trying to reconstruct out of the array elements, as BrowserUK suggests.

Re: Pattern search in a Multiline Record, from a multi-record datafile.
by TJPride (Pilgrim) on Mar 02, 2012 at 18:20 UTC
    use strict; use warnings; my $bad_states = { 'bad' => 1 }; { local $/ = '</ticket>'; my $temp; while (<DATA>) { if (m/<ticket .*?state="(.*?)"/) { next if $bad_states->{$1}; } if (m/<version="(.*?)"/) { $temp = $1; next if $temp =~ /[^\d\.]/; } if (m/<task>(.*?)<\/task>/) { $temp = $1; ### DO SOMETHING WITH TASK VALUE } } } __DATA__ <ticket ... state="bad" ....> <version="1.23.4.8"> .... <task> ... </task> .... </ticket> <ticket ... state="good" ....> <version="bad"> .... <task> ... </task> .... </ticket> <ticket ... state="good" ....> <version="1.23.4.8"> .... <task>value here</task> .... </ticket>

      Thanks all but my requirments have now slightly changed. (Issue A) Instead of using a file which contains the datadump, I have to read the data from a url. I'm able to read it line by line but not as an array of multi-line records. This is what I have now:

      my $url = "http://......."; my $user_agent = new LWP::UserAgent; my $report = $user_agent->get($url); $report = $report->content; my @code = split("\n", $report);

      Another problem is that as part of the pattern searching within each record from the content of the url, for <task ... />, I need append it as such <task ... a="123"/> and then return this to the parent record and Print that record. So the Input from the url will look something like:"

      <ticket ... state="bad" ....> <version="1.23.4.8"> .... <task ... /> .... </ticket> <ticket ... state="good" ....> <version="bad"> .... <task ... /> .... </ticket> <ticket ... state="good" ....> <version="1.12.32.4"> .... <task ... /> .... </ticket> <ticket ... state="good" ....> <version="1.1.3.9"> .... <task ... /> <task ... /> <task ... /> .... </ticket>

      So in this case, the output should contain only the last 2 records(tickets) as the first 2 records fail the required conditions and the last 2 records which do pass, process the <task ... > lines and are printed accordingly. Also note that there may be more than 1 <task ... > per record. The Output will be dumped into an xml file and should look like:

      <ticket ... state="good" ....> <version="1.12.32.4"> .... <task ... a="1.1.3"/> .... </ticket> <ticket ... state="good" ....> <version="1.1.3.9"> .... <task ... a="1.2"/> <task ... a="5.1"/> <task ... a="8"/> .... </ticket>

      (Issue B): Curently, I'm able to append a=".." to <task ... a=".."> but when I try to print it as part of the updated parent record, its still prining it as it was in the origanl record.

      As you can see, I'm just starting in Perl. Please do help.