Match on line, read backwards to opening xml tag then forward to closing tag

shadowfox has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Match on line, read backwards to opening xml tag then forward to closing tag by toolic (Bishop) on Nov 14, 2011 at 15:34 UTC
If you are interested in using an XML parser instead of a regex, XML::Twig can give you the whole element you are looking for: use warnings; use strict; use XML::Twig; my $str = ' <DataEnd> <Dataentry> <Data>aaaaaa</Data> <Data>aaaaaa</Data> <Data>aaaaaa</Data> </Dataentry> <Dataentry> <Data>aaaaaa</Data> <Data>aaaaaa</Data> <Data>aaaaaa</Data> </Dataentry> <Dataentry> <Data>aaaaaa</Data> <Data>bbbbbb</Data> <Data>aaaaaa</Data> </Dataentry> <Dataentry> <Data>aaaaaa</Data> <Data>aaaaaa</Data> <Data>aaaaaa</Data> </Dataentry> </DataEnd> '; my $t = XML::Twig->new( twig_handlers => { Dataentry => \&dentry } ); $t->parse($str); sub dentry { my ($t, $ent) = @_; my $found = 0; for my $data ($ent->children('Data')) { if ($data->text() eq 'bbbbbb') { $found = 1; next; } } if ($found) { # do something } } [download]	[reply] [d/l]
Re^2: Match on line, read backwards to opening xml tag then forward to closing tag by admiral_grinder (Pilgrim) on Nov 15, 2011 at 15:04 UTC
I'll see your use of a XML parser, but raise you XML::LibXML. It is a pain to install on Windows systems but not impossible. I have used both alot and you can get significant speed increases with LibXML (talking seconds to nanoseconds)	[reply]
Re: Match on line, read backwards to opening xml tag then forward to closing tag by jethro (Monsignor) on Nov 14, 2011 at 15:39 UTC
Cache the part you are interested in `my @cache; my $found=0; while (<$file>) { if ( ... found the match ...) { $found++; } if (/^<DataStart>/) { @cache=(); } push(@cache, $_); if (/^<DataEnd>/ and $found) { Do_stuff_with_match(@cache); $found=0; } }` [download]	[reply] [d/l]
Re^2: Match on line, read backwards to opening xml tag then forward to closing tag by shadowfox (Beadle) on Nov 14, 2011 at 20:41 UTC
Several interesting ideas floating around but I'd like to try one like this, jethro's being the closest to what I'd like to use. I realized my inital XML example was flawed, so let me try again with a more clear example. <DataStore> <DataRecord> <Data>123456</Data> <Data2>654321</Data2> <Data>123456</Data> </DataRecord> <DataRecord> <Data>123456</Data> <Data>123456</Data> <Data2>123456</Data2> <Data>1234/3456</Data> <Data>123456</Data> <Data>1234/3456</Data> <Data3>123456</Data3> <Data>123456</Data> </DataRecord> <DataRecord> <Data>123456</Data> <Data>123456</Data> <Data5>123456</Data5> </DataRecord> </DataStore> # From that I want it to loop through and store each <DataRecord> ... +</DataRecord> # From then, if it matches on 4 digits followed by a forward slash I # want it to output the whole <DataRecord> to screen, not just the mat +ched lines from second filter. # For that, I've tried this example open(FILE, "< $FILE") or die "ERROR: $!"; while (<>) { if (/<DataRecord>/ ... /<\/DataRecord>/) { @cache=(); } push(@cache, $_); if (m/<Data>\d{4}\//){ print @cache; } } close (FILE); # The output of that is <Data>1234/3456</Data> <Data>1234/3456</Data> # where I would prefer to see <Data>123456</Data> <Data>123456</Data> <Data2>123456</Data2> <Data>1234/3456</Data> <Data>123456</Data> <Data>1234/3456</Data> <Data3>123456</Data3> <Data>123456</Data> [download] I wrote it several different ways, and either it prints every <DataRecord> or only filtered <Data> lines, neither is what I need. I want it to print the entire <DataRecord> if that record matches on the second pattern. Clearly I'm doing it wrong but I'm not seeing what, so I assume its glaringly obvious.	[reply] [d/l]
Re^3: Match on line, read backwards to opening xml tag then forward to closing tag by choroba (Cardinal) on Nov 14, 2011 at 23:39 UTC
Move the push inside the if block - only cache the lines between the matches. Also, your condition for printing is tested for each line, so the program might print too early - only set a flag and print after the whole record was read if the flag is set.	[reply]
Re^3: Match on line, read backwards to opening xml tag then forward to closing tag by jethro (Monsignor) on Nov 15, 2011 at 10:32 UTC
Probably you changed my script because I used "<DataStart>" and "<DataEnd>" instead of the correct "<Dataentry>" and "</Dataentry>" in my regexes. Sorry about that mistake but apart from that my script is working (I tested it just now to be sure). Just use the right strings in the regexes and the script will work, even with the new data you provided. `my @cache; my $found=0; while (<$file>) { if ( /stringtobefound/) { $found++; } if (/<start fo record>/) { @cache=(); } push(@cache, $_); # print "-----------\nFound is $found, Cache is\n".@cache."----------- +---"; if (/<\/end of record>/ and $found) { print @cache; $found=0; } }` [download] A tip on general debugging: If something doesn't work, print out important variables and watch what your script is doing and find the first place where it does something different than it should. See the comment line for an example, with that you can see if the cache works or not	[reply] [d/l]
Re^4: Match on line, read backwards to opening xml tag then forward to closing tag by shadowfox (Beadle) on Nov 15, 2011 at 13:49 UTC
Re^3: Match on line, read backwards to opening xml tag then forward to closing tag by furry_marmot (Pilgrim) on Nov 16, 2011 at 20:08 UTC
I want it to loop through and store each <DataRecord>...</DataRecord>. Then, if it matches on 4 digits followed by a forward slash I want it to output the whole <DataRecord> to screen, not just the matched lines from second filter. Okay, try this. You were very close, but it seems a bit more complicated than necessary. Also, is the cache just to hold the matches until you print them? If so, you could eliminate that step entirely. `open(FILE, "< $FILE") or die "ERROR: $!"; my $data; { local $/=undef; $data=<FILE> } while ( $data =~ m{<DataRecord>(.+?)</DataRecord>}sg ) { my $rec = $1; if ( $rec =~ m{\d+/\d+} ) { push @cache, $rec; print "$rec"; } } close (FILE);` [download] Prints: `$ test.pl <Data>123456</Data> <Data>123456</Data> <Data2>123456</Data2> <Data>1234/3456</Data> <Data>123456</Data> <Data>1234/3456</Data> <Data3>123456</Data3> <Data>123456</Data>` [download]	[reply] [d/l] [select]
Re: Match on line, read backwards to opening xml tag then forward to closing tag by RichardK (Parson) on Nov 14, 2011 at 15:38 UTC
If your data is proper XML always use one of the many XML parsing modules, it's lots easier than trying to get your code to handle all of the special cases. Try XML::Simple for a start.	[reply]
Re: Match on line, read backwards to opening xml tag then forward to closing tag by choroba (Cardinal) on Nov 14, 2011 at 17:05 UTC
I usually use XML::XSH2 for XML manipulation. After fixing errors in your XML and wrapping it into a root node, I was able to find the corresponding Dataentry just with the following lines: `open 937973.xml ; for //Data[.='bbbbbb'] ls .. ;` [download]	[reply] [d/l]
Re: Match on line, read backwards to opening xml tag then forward to closing tag by Anonymous Monk on Nov 14, 2011 at 17:24 UTC
use Modern::Perl; $_ = ' <Dataentry> <Data>aaaaaa</Data> <Data>aaaaaa</Data> <Data>aaaaaa</Data> </Dataentry> <Dataentry> <Data>aaaaaa</Data> <Data>aaaaaa</Data> <Data>aaaaaa</Data> </Dataentry> <Dataentry> <Data>aaaaaa</Data> <Data>bbbbbb</Data> <Data>aaaaaa</Data> </Dataentry> <Dataentry> <Data>aaaaaa</Data> <Data>aaaaaa</Data> <Data>aaaaaa</Data> </DataEntry>'; my ($f,@l);my $d='Dataentry';sub p { push (@l,"$_[0]\n");$f \|\|= $_[0]=~ m@>bbbbbb<@s;@l = '' if ($_[0]=~ m@</$d>@s);if (($_[0]=~ m@<$d>@s)&&($f)) {pop @l;say "<$d>";print reverse @l; say "</$d>";@l = '';$f = 0;}} p $_ for reverse split '\n'; [download] Who do __YOU__ know that writes regex's like that?	[reply] [d/l]
Re^2: Match on line, read backwards to opening xml tag then forward to closing tag by Anonymous Monk on Nov 14, 2011 at 23:08 UTC
`my ($flag,@lines); my $d='Dataentry'; sub process { push (@lines,"$_[0]\n"); $flag \|\|= $_[0]=~ m@>bbbbbb<@s; @l ='' if ($_[0]=~m@</$d>@s); if (($_[0]=~ m@<$d>@s)&&($flag)) { pop @lines; say "<$d>"; print reverse @lines; say "</$d>"; @lines = ''; $flag = 0; } } process $_ for reverse split '\n';` [download]	[reply] [d/l]
Re: Match on line, read backwards to opening xml tag then forward to closing tag by mrguy123 (Hermit) on Nov 15, 2011 at 07:44 UTC
Hi, I also believe that it is best to use a proper XML parser if you can, but I have encountered similar problems with non XML data, and then you do need to use regexes. I think this is a nice workaround that will solve your problem (I hope) use strict; use warnings; { my ($input_file) = @ARGV; open XML, $input_file or die; my $xml_data; ##Read the XML into xml_data while (my $line = <XML>){ $xml_data .= $line; } ##Parse xml_data while ($xml_data =~ /<Dataentry>(.?)<\/Dataentry>/s){ ##Keep xml block to print if needed my $xml_block_to_print = $1; my $xml_block_to_parse = $xml_block_to_print; $xml_data = $'; ##Parse the xml block, and if we have a match, print! while ($xml_block_to_parse =~ /<Data>(.?)<\/Data>/){ my $data = $1; $xml_block_to_parse = $'; if ($data eq 'bbbbbb'){ print "$xml_block_to_print\n"; last; } } } } [download] Please let me know if this works for you mrguy "Unix is user friendly - it's just a bit more choosy about who it's friends are."	[reply] [d/l]