one line regexp problem

hoodlooms has asked for the wisdom of the Perl Monks concerning the following question:

Hello:

I have right now the following code:

#!/usr/bin/perl -w


my $str = q~
<IPDR>
<StartTime>1970-01-01T00:00:00.000Z</StartTime>
<EndTime>1970-01-01T00:00:00.000Z</EndTime>
<timeZoneOffset>0</timeZoneOffset>
<callCompletionCode>CC</callCompletionCode>
<originalDestinationId></originalDestinationId>
<hostName>GWC2@LCSTCHVT02R</hostName>
<subscriberId>E1/01/18/5@CIS01S03</subscriberId>
<uniqueCallId>14007010000532ca0d25</uniqueCallId>
<ipAddress>192.168.47.1</ipAddress>
<portNumber>2944</portNumber>
<seqNum>30756</seqNum>
<averagePacketLatency>0</averagePacketLatency>
<inboundByteCount>0</inboundByteCount>
<outboundByteCount>4</outboundByteCount>
<inboundPacketCount>0</inboundPacketCount>
<outboundPacketCount>1</outboundPacketCount>
<inboundLostPacketCount>0</inboundLostPacketCount>
<packetDelayVariation>0</packetDelayVariation>
</IPDR>
<IPDR>
<StartTime>1970-01-01T00:00:00.000Z</StartTime>
<EndTime>1970-01-01T00:00:00.000Z</EndTime>
<timeZoneOffset>0</timeZoneOffset>
<callCompletionCode>CC</callCompletionCode>
<originalDestinationId></originalDestinationId>
<hostName>GWC1@LCSTCHVT02R</hostName>
<subscriberId>N19Q01.test.cl</subscriberId>
<uniqueCallId>14007010000532ca0d25</uniqueCallId>
<ipAddress>10.100.9.27</ipAddress>
<portNumber>2427</portNumber>
<seqNum>30634</seqNum>
<averagePacketLatency>0</averagePacketLatency>
<inboundByteCount>4</inboundByteCount>
<outboundByteCount>9120</outboundByteCount>
<inboundPacketCount>1</inboundPacketCount>
<outboundPacketCount>57</outboundPacketCount>
<inboundLostPacketCount>0</inboundLostPacketCount>
<packetDelayVariation>0</packetDelayVariation>
</IPDR>
<IPDR>
<StartTime>1970-01-01T00:00:00.000Z</StartTime>
<EndTime>1970-01-01T00:00:00.000Z</EndTime>
<timeZoneOffset>0</timeZoneOffset>
<callCompletionCode>CC</callCompletionCode>
<originalDestinationId></originalDestinationId>
<hostName>GWC2@LCSTCHVT02R</hostName>
<subscriberId>E1/01/18/5@CIS01S03</subscriberId>
<uniqueCallId>14007010000532ca0d25</uniqueCallId>
<ipAddress>192.168.47.1</ipAddress>
<portNumber>2944</portNumber>
<seqNum>30756</seqNum>
<averagePacketLatency>0</averagePacketLatency>
<inboundByteCount>0</inboundByteCount>
<outboundByteCount>4</outboundByteCount>
<inboundPacketCount>0</inboundPacketCount>
<outboundPacketCount>1</outboundPacketCount>
<inboundLostPacketCount>0</inboundLostPacketCount>
<packetDelayVariation>0</packetDelayVariation>
</IPDR>
~;

$str =~ s/\n//g;

my @fetch = ($str =~
m/
(
\<IPDR\>
(?:.*?)
(?:test)+
(?:.*?)
\<\/IPDR\>
)
/xgi
);


print join("\n",@fetch);
[download]

What Im trying to do is fill the array with paragraphs that start with <IPDR> and end with </IPDR>, but only those which have the "test" word in them.

The problema is that, although I think my regexp should work, it doesnt divide the paragraphs, instead it prints them all in one big line.

Any help is appreciated.

THANKS

Comment on one line regexp problem Download Code

Replies are listed 'Best First'.
Re: one line regexp problem by ikegami (Patriarch) on Sep 24, 2004 at 22:19 UTC
How about: `# Keeps newlines. my $str = q~ ... ~; my @fetch = grep { /test/ } ($str =~ m#(<IPDR>.?</IPDR>)#sxgi); print(scalar(@fetch), "\n"); print join("\n---\n",@fetch);` [download] or: `# Removes newlines. my $str = q~ ... ~; my @fetch = map { local $_=$_; s/\n+//g; $_ } grep { /test/ } ($str =~ m#(<IPDR>.?</IPDR>)#sxgi); print(scalar(@fetch), "\n"); print join("\n---\n",@fetch);` [download]	[reply] [d/l] [select]
Re: one line regexp problem by BUU (Prior) on Sep 25, 2004 at 08:23 UTC
Is it just me, or is that XML? (or at least, a portion of an xml document?). If it is xml, why don't you use an XML parser? Prewritten, very fast, easy to use, flexible, etc. Then you don't have to rely on clunky fragile regexen.	[reply]
Re: one line regexp problem by JediWizard (Deacon) on Sep 24, 2004 at 22:20 UTC
Try this expression instead. `my @fetch = ($str =~ m/ ( \<IPDR\> (?:(?!<\/IPDR>).)? (?:test)+ (?:(?!<\/IPDR>).)? \<\/IPDR\> ) /xgi);` [download] May the Force be with you	[reply] [d/l]
Re^2: one line regexp problem by ikegami (Patriarch) on Sep 24, 2004 at 22:28 UTC
This is just JediWizard's solution with a bunch of junk removed: `my @fetch = ($str =~ m# ( <IPDR> (?:(?!</IPDR>).)* test (?:(?!</IPDR>).)* </IPDR> ) #sxgi);` [download] And because of the 's' at the end, there's no need to remove the newlines beforehand.	[reply] [d/l]
Re^3: one line regexp problem by hv (Prior) on Sep 25, 2004 at 12:18 UTC
That'll work fine, but it will have efficiency problems on some types of data. Here's how the regexp engine would try to match that pattern when successful: `find the first occurrence of '<IPDR>' # fast walk the string to the '</IPDR>' # fastish skip back to the last occurrence of 'test' # fast walk forward to the '</IPDR>' again # fastish signal success # done` [download] however if the tail were missing, eg if it had '<IPDR>' at the end by mistake instead of '</IPDR>', it would go like this: `find the first occurrence of '<IPDR>' # fast walk the string to the end # fastish skip back to the last occurrence of 'test' # fast walk forward to end of string again # fastish repeat last two steps for each additional occurrence of 'test' # slow - quadratic in the number of occurrences of 'test' (no more 'test's to try) signal failure # done` [download] Now, this may not be a problem for the original poster: they may already have confirmed that their data is well-formed. However it is an avoidable problem, either by checking for 'test' in a separate grep (as in your original solution), or by using the 'cut' operator to avoid the useless quadratic backtracking. Here's some code to benchmark the difference, trying with and without the 'cut' operator, with strings that both do and don't match, and with 1, 10, 100 or 1000 copies of 'test' in the string: Read more... (1205 Bytes) And the (reformatted) results: `uncut1g: 12799.05/s cut1g: 12799.05/s uncut1b: 11821.50/s cut1b: 8219.27/s uncut10g: 11486.54/s cut10g: 11486.54/s uncut10b: 2559.05/s cut10b: 7244.34/s uncut100g: 5743.27/s cut100g: 5743.27/s uncut100b: 125.00/s cut100b: 3381.13/s uncut1000g: 956.48/s cut1000g: 956.48/s uncut1000b: 1.85/s cut1000b: 532.38/s` [download] From this it is clear that a) the cut operator costs nothing when it is not used (all the g(ood) matches are the same speed with and without the cut), b) when there is just one 'test' to backtrack over the cost of the extra bookkeeping is quite high (cut1b is 36% slower than the good case, while uncut1b is only 8% slower), but c) the cost is linear rather than quadratic (as the number of 'test's increases, the 'cut<n>b' version continues to give away 36% to the 'cut<n>g' version, while the 'uncut<n>b' gets slower much more dramatically. Hugo	[reply] [d/l] [select]
Re^3: one line regexp problem by JediWizard (Deacon) on Sep 24, 2004 at 23:40 UTC
Thank you for expanding on my answer. I hadn't even realized that the s flag wasn't on, I usually use smx on all my regex's. May the Force be with you	[reply]