in reply to Match on line, read backwards to opening xml tag then forward to closing tag

Cache the part you are interested in

my @cache; my $found=0; while (<$file>) { if ( ... found the match ...) { $found++; } if (/^<DataStart>/) { @cache=(); } push(@cache, $_); if (/^<DataEnd>/ and $found) { Do_stuff_with_match(@cache); $found=0; } }
  • Comment on Re: Match on line, read backwards to opening xml tag then forward to closing tag
  • Download Code

Replies are listed 'Best First'.
Re^2: Match on line, read backwards to opening xml tag then forward to closing tag
by shadowfox (Beadle) on Nov 14, 2011 at 20:41 UTC
    Several interesting ideas floating around but I'd like to try one like this, jethro's being the closest to what I'd like to use. I realized my inital XML example was flawed, so let me try again with a more clear example.
    <DataStore> <DataRecord> <Data>123456</Data> <Data2>654321</Data2> <Data>123456</Data> </DataRecord> <DataRecord> <Data>123456</Data> <Data>123456</Data> <Data2>123456</Data2> <Data>1234/3456</Data> <Data>123456</Data> <Data>1234/3456</Data> <Data3>123456</Data3> <Data>123456</Data> </DataRecord> <DataRecord> <Data>123456</Data> <Data>123456</Data> <Data5>123456</Data5> </DataRecord> </DataStore> # From that I want it to loop through and store each <DataRecord> ... +</DataRecord> # From then, if it matches on 4 digits followed by a forward slash I # want it to output the whole <DataRecord> to screen, not just the mat +ched lines from second filter. # For that, I've tried this example open(FILE, "< $FILE") or die "ERROR: $!"; while (<>) { if (/<DataRecord>/ ... /<\/DataRecord>/) { @cache=(); } push(@cache, $_); if (m/<Data>\d{4}\//){ print @cache; } } close (FILE); # The output of that is <Data>1234/3456</Data> <Data>1234/3456</Data> # where I would prefer to see <Data>123456</Data> <Data>123456</Data> <Data2>123456</Data2> <Data>1234/3456</Data> <Data>123456</Data> <Data>1234/3456</Data> <Data3>123456</Data3> <Data>123456</Data>
    I wrote it several different ways, and either it prints every <DataRecord> or only filtered <Data> lines, neither is what I need. I want it to print the entire <DataRecord> if that record matches on the second pattern. Clearly I'm doing it wrong but I'm not seeing what, so I assume its glaringly obvious.
      Move the push inside the if block - only cache the lines between the matches. Also, your condition for printing is tested for each line, so the program might print too early - only set a flag and print after the whole record was read if the flag is set.

      Probably you changed my script because I used "<DataStart>" and "<DataEnd>" instead of the correct "<Dataentry>" and "</Dataentry>" in my regexes. Sorry about that mistake but apart from that my script is working (I tested it just now to be sure). Just use the right strings in the regexes and the script will work, even with the new data you provided.

      my @cache; my $found=0; while (<$file>) { if ( /stringtobefound/) { $found++; } if (/<start fo record>/) { @cache=(); } push(@cache, $_); # print "-----------\nFound is $found, Cache is\n".@cache."----------- +---"; if (/<\/end of record>/ and $found) { print @cache; $found=0; } }

      A tip on general debugging: If something doesn't work, print out important variables and watch what your script is doing and find the first place where it does something different than it should. See the comment line for an example, with that you can see if the cache works or not

        Thanks Jethro, that is exactly what I wanted. Also thanks choroba, you're right about where my logic failed, I was focusing to hard on it to see what the issue. And thanks to everyone else who replied too, I know XML parsing is the better way to go when possible and it looks much easier to use.

      I want it to loop through and store each <DataRecord>...</DataRecord>. Then, if it matches on 4 digits followed by a forward slash I want it to output the whole <DataRecord> to screen, not just the matched lines from second filter.

      Okay, try this. You were very close, but it seems a bit more complicated than necessary. Also, is the cache just to hold the matches until you print them? If so, you could eliminate that step entirely.

      open(FILE, "< $FILE") or die "ERROR: $!"; my $data; { local $/=undef; $data=<FILE> } while ( $data =~ m{<DataRecord>(.+?)</DataRecord>}sg ) { my $rec = $1; if ( $rec =~ m{\d+/\d+} ) { push @cache, $rec; print "$rec"; } } close (FILE);
      Prints:
      $ test.pl <Data>123456</Data> <Data>123456</Data> <Data2>123456</Data2> <Data>1234/3456</Data> <Data>123456</Data> <Data>1234/3456</Data> <Data3>123456</Data3> <Data>123456</Data>