Re: Match on line, read backwards to opening xml tag then forward to closing tag
by toolic (Bishop) on Nov 14, 2011 at 15:34 UTC
|
If you are interested in using an XML parser instead of a regex, XML::Twig can give you the whole element you are looking for:
use warnings;
use strict;
use XML::Twig;
my $str = '
<DataEnd>
<Dataentry>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
</Dataentry>
<Dataentry>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
</Dataentry>
<Dataentry>
<Data>aaaaaa</Data>
<Data>bbbbbb</Data>
<Data>aaaaaa</Data>
</Dataentry>
<Dataentry>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
</Dataentry>
</DataEnd>
';
my $t = XML::Twig->new(
twig_handlers => { Dataentry => \&dentry }
);
$t->parse($str);
sub dentry {
my ($t, $ent) = @_;
my $found = 0;
for my $data ($ent->children('Data')) {
if ($data->text() eq 'bbbbbb') {
$found = 1;
next;
}
}
if ($found) {
# do something
}
}
| [reply] [d/l] |
|
|
I'll see your use of a XML parser, but raise you XML::LibXML. It is a pain to install on Windows systems but not impossible. I have used both alot and you can get significant speed increases with LibXML (talking seconds to nanoseconds)
| [reply] |
Re: Match on line, read backwards to opening xml tag then forward to closing tag
by jethro (Monsignor) on Nov 14, 2011 at 15:39 UTC
|
my @cache;
my $found=0;
while (<$file>) {
if ( ... found the match ...) {
$found++;
}
if (/^<DataStart>/) {
@cache=();
}
push(@cache, $_);
if (/^<DataEnd>/ and $found) {
Do_stuff_with_match(@cache);
$found=0;
}
}
| [reply] [d/l] |
|
|
Several interesting ideas floating around but I'd like to try one like this, jethro's being the closest to what I'd like to use. I realized my inital XML example was flawed, so let me try again with a more clear example.
<DataStore>
<DataRecord>
<Data>123456</Data>
<Data2>654321</Data2>
<Data>123456</Data>
</DataRecord>
<DataRecord>
<Data>123456</Data>
<Data>123456</Data>
<Data2>123456</Data2>
<Data>1234/3456</Data>
<Data>123456</Data>
<Data>1234/3456</Data>
<Data3>123456</Data3>
<Data>123456</Data>
</DataRecord>
<DataRecord>
<Data>123456</Data>
<Data>123456</Data>
<Data5>123456</Data5>
</DataRecord>
</DataStore>
# From that I want it to loop through and store each <DataRecord> ...
+</DataRecord>
# From then, if it matches on 4 digits followed by a forward slash I
# want it to output the whole <DataRecord> to screen, not just the mat
+ched lines from second filter.
# For that, I've tried this example
open(FILE, "< $FILE") or die "ERROR: $!";
while (<>) {
if (/<DataRecord>/ ... /<\/DataRecord>/) {
@cache=();
}
push(@cache, $_);
if (m/<Data>\d{4}\//){
print @cache;
}
}
close (FILE);
# The output of that is
<Data>1234/3456</Data>
<Data>1234/3456</Data>
# where I would prefer to see
<Data>123456</Data>
<Data>123456</Data>
<Data2>123456</Data2>
<Data>1234/3456</Data>
<Data>123456</Data>
<Data>1234/3456</Data>
<Data3>123456</Data3>
<Data>123456</Data>
I wrote it several different ways, and either it prints every <DataRecord> or only filtered <Data> lines, neither is what I need. I want it to print the entire <DataRecord> if that record matches on the second pattern. Clearly I'm doing it wrong but I'm not seeing what, so I assume its glaringly obvious. | [reply] [d/l] |
|
|
Move the push inside the if block - only cache the lines between the matches. Also, your condition for printing is tested for each line, so the program might print too early - only set a flag and print after the whole record was read if the flag is set.
| [reply] |
|
|
Probably you changed my script because I used "<DataStart>" and "<DataEnd>" instead of the correct "<Dataentry>" and "</Dataentry>" in my regexes. Sorry about that mistake but apart from that my script is working (I tested it just now to be sure). Just use the right strings in the regexes and the script will work, even with the new data you provided.
my @cache;
my $found=0;
while (<$file>) {
if ( /stringtobefound/) {
$found++;
}
if (/<start fo record>/) {
@cache=();
}
push(@cache, $_);
# print "-----------\nFound is $found, Cache is\n".@cache."-----------
+---";
if (/<\/end of record>/ and $found) {
print @cache;
$found=0;
}
}
A tip on general debugging: If something doesn't work, print out important variables and watch what your script is doing and find the first place where it does something different than it should. See the comment line for an example, with that you can see if the cache works or not
| [reply] [d/l] |
|
|
|
|
I want it to loop through and store each <DataRecord>...</DataRecord>. Then, if it matches on 4 digits followed by a forward slash I want it to output the whole <DataRecord> to screen, not just the matched lines from second filter.
Okay, try this. You were very close, but it seems a bit more complicated than necessary. Also, is the cache just to hold the matches until you print them? If so, you could eliminate that step entirely.
open(FILE, "< $FILE") or die "ERROR: $!";
my $data;
{
local $/=undef;
$data=<FILE>
}
while ( $data =~ m{<DataRecord>(.+?)</DataRecord>}sg ) {
my $rec = $1;
if ( $rec =~ m{\d+/\d+} ) {
push @cache, $rec;
print "$rec";
}
}
close (FILE);
Prints:
$ test.pl
<Data>123456</Data>
<Data>123456</Data>
<Data2>123456</Data2>
<Data>1234/3456</Data>
<Data>123456</Data>
<Data>1234/3456</Data>
<Data3>123456</Data3>
<Data>123456</Data>
| [reply] [d/l] [select] |
Re: Match on line, read backwards to opening xml tag then forward to closing tag
by RichardK (Parson) on Nov 14, 2011 at 15:38 UTC
|
If your data is proper XML always use one of the many XML parsing modules, it's lots easier than trying to get your code to handle all of the special cases. Try XML::Simple for a start.
| [reply] |
Re: Match on line, read backwards to opening xml tag then forward to closing tag
by choroba (Cardinal) on Nov 14, 2011 at 17:05 UTC
|
I usually use XML::XSH2 for XML manipulation. After fixing errors in your XML and wrapping it into a root node, I was able to find the corresponding Dataentry just with the following lines:
open 937973.xml ;
for //Data[.='bbbbbb'] ls .. ;
| [reply] [d/l] |
Re: Match on line, read backwards to opening xml tag then forward to closing tag
by Anonymous Monk on Nov 14, 2011 at 17:24 UTC
|
use Modern::Perl;
$_ = '
<Dataentry>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
</Dataentry>
<Dataentry>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
</Dataentry>
<Dataentry>
<Data>aaaaaa</Data>
<Data>bbbbbb</Data>
<Data>aaaaaa</Data>
</Dataentry>
<Dataentry>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
<Data>aaaaaa</Data>
</DataEntry>';
my ($f,@l);my $d='Dataentry';sub p {
push (@l,"$_[0]\n");$f ||= $_[0]=~
m@>bbbbbb<@s;@l = '' if ($_[0]=~
m@</$d>@s);if (($_[0]=~ m@<$d>@s)&&($f))
{pop @l;say "<$d>";print reverse @l;
say "</$d>";@l = '';$f = 0;}}
p $_ for reverse split '\n';
Who do __YOU__ know that writes regex's like that? | [reply] [d/l] |
|
|
my ($flag,@lines);
my $d='Dataentry';
sub process {
push (@lines,"$_[0]\n");
$flag ||= $_[0]=~ m@>bbbbbb<@s;
@l ='' if ($_[0]=~m@</$d>@s);
if (($_[0]=~ m@<$d>@s)&&($flag))
{
pop @lines;
say "<$d>";
print reverse @lines;
say "</$d>";
@lines = '';
$flag = 0;
}
}
process $_ for reverse split '\n';
| [reply] [d/l] |
Re: Match on line, read backwards to opening xml tag then forward to closing tag
by mrguy123 (Hermit) on Nov 15, 2011 at 07:44 UTC
|
Hi,
I also believe that it is best to use a proper XML parser if you can, but I have encountered similar problems with non XML data, and then you do need to use regexes.
I think this is a nice workaround that will solve your problem (I hope)
use strict;
use warnings;
{
my ($input_file) = @ARGV;
open XML, $input_file or die;
my $xml_data;
##Read the XML into xml_data
while (my $line = <XML>){
$xml_data .= $line;
}
##Parse xml_data
while ($xml_data =~ /<Dataentry>(.*?)<\/Dataentry>/s){
##Keep xml block to print if needed
my $xml_block_to_print = $1;
my $xml_block_to_parse = $xml_block_to_print;
$xml_data = $';
##Parse the xml block, and if we have a match, print!
while ($xml_block_to_parse =~ /<Data>(.*?)<\/Data>/){
my $data = $1;
$xml_block_to_parse = $';
if ($data eq 'bbbbbb'){
print "$xml_block_to_print\n";
last;
}
}
}
}
Please let me know if this works for you
mrguy
"Unix is user friendly - it's just a bit more choosy about who it's
friends are." | [reply] [d/l] |