A query on greedy regexps

sch has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,
I was hoping some of you could cast your experienced eyes over this.

I'm trying to process some text as shown below:

graphics   0  1          graph3      CLAIMED     INTERFACE    Graphics
ext_bus    0  2/0/1      c720        CLAIMED     INTERFACE    Built-in
+ SCSI
disk       0  2/0/1.0.0  sdisk       CLAIMED     DEVICE       HP      
+C2247M1
                        /dev/dsk/c0t0d0   /dev/rdsk/c0t0d0
disk       1  2/0/1.2.0  sdisk       CLAIMED     DEVICE       TOSHIBA 
+CD-ROM XM-3401TA
                        /dev/dsk/c0t2d0   /dev/rdsk/c0t2d0
disk       2  2/0/1.6.0  sdisk       CLAIMED     DEVICE       SEAGATE 
+ST31200N
                        /dev/dsk/c0t6d0   /dev/rdsk/c0t6d0
ctl        0  2/0/1.7.0  sctl        CLAIMED     DEVICE       Initiato
+r
                        /dev/rscsi/c0t7d0
lan        0  2/0/2      lan2        CLAIMED     INTERFACE    Built-in
+ LAN
                        /dev/diag/lan0  /dev/ether0   
tty        0  2/0/4      asio0       CLAIMED     INTERFACE    Built-in
+ RS-232C
                        /dev/diag/mux0  /dev/mux0       /dev/tty0p0   
ext_bus    1  2/0/6      CentIf      CLAIMED     INTERFACE    Built-in
+ Parallel Interface
                        /dev/c1t0d0_lp
audio      0  2/0/8      audio       CLAIMED     INTERFACE    Built-in
+ Audio
                        /dev/audio       /dev/audioEL_0   /dev/audioLL
+_0 
                        /dev/audioBA     /dev/audioEU     /dev/audioLU
+   
                        /dev/audioBA_0   /dev/audioEU_0   /dev/audioLU
+_0 
                        /dev/audioBL     /dev/audioIA     /dev/audioNA
+   
                        /dev/audioBL_0   /dev/audioIA_0   /dev/audioNA
+_0 
                        /dev/audioBU     /dev/audioIL     /dev/audioNL
+   
                        /dev/audioBU_0   /dev/audioIL_0   /dev/audioNL
+_0 
                        /dev/audioCtl    /dev/audioIU     /dev/audioNU
+   
                        /dev/audioCtl_0  /dev/audioIU_0   /dev/audioNU
+_0 
                        /dev/audioEA     /dev/audioLA     /dev/audio_0
+   
                        /dev/audioEA_0   /dev/audioLA_0 
                        /dev/audioEL     /dev/audioLL   
pc         0  2/0/10     fdc         CLAIMED     INTERFACE    Built-in
+ Floppy Drive
ps2        0  2/0/11     ps2         CLAIMED
[download]

(some of you may recognise this as an hpux ioscan output)
and I have the following chunk of code:

($header,@devices) = m/(Class.+\n=+\n)(^\w.+\n(?:\s+.+\n)*)/mg

What I'm trying to do is split out each seperate device and its associated text into an array

On my journey to enlightenment, I've started to venture beyond the world of very simple regexps into slightly more involved versions, and reading through perlmonks I've seen mention of greedy regexps and I think I understand why they're a bad thing

I suspect that this regexp is a prime example of such a beast, but I was hoping for the views of more enlightened members of PM and any pointers on how the regexp could be improved if necessary

Edited: ~Tue Oct 1 16:07:42 2002 (GMT) by footpad: Replaced <pre> tags with <code> tags, per Consideration

Comment on A query on greedy regexps Select or Download Code

Replies are listed 'Best First'.
Re: A query on greedy regexps by broquaint (Abbot) on Oct 01, 2002 at 11:36 UTC
For a good start on greedy regexps see Ovid's classic Death to Dot Star!. Essentially the greediness of a regexp relates to the fact that both `` and `+` will match the preceding atom for as long as possible. However you can alter this behaviour by appending either modifier with a question mark e.g `my $str = "123 foo 456 foo 789 foo 10"; # will match everything up to the last 'foo' print "dot star greedy: ", $str =~ /(.)foo/, $/; # will match everything up to the first 'foo' print "dot star non-greedy: ", $str =~ /(.?)foo/, $/; __output__ dot star greedy: 123 foo 456 foo 789 dot star non-greedy: 123` [download] HTH `_________ broquaint` update:* made code more illustrative	[reply] [d/l]
Re: A query on greedy regexps by davis (Vicar) on Oct 01, 2002 at 12:23 UTC
Greedy operators, as explained by broquaint, gobble up as much of the input text as possible that still allows the regex to match. AIUI, if the match fails, perl's regular expression engine will force one character (unicode aside) from the text matched by the `.*` to be "given back", to see if the rest of the expression can match. (Caveat: I've realised I need to read O'Reilly's "Mastering Regular Expressions" again, so this might well be wrong.) Looking at the regex you've created, I can't see any advantage to making the quantifiers non-greedy - they all need to match to the end of the line anyway. If you're looking for a robust solution, then, as I mentioned in the CB, I'd recommend installing HPUX::Ioscan and using that, or at least looking at the code to see how it does it. (Hint: It does it by looping over each line) Summary: I think your regex is fine (although you forgot the "Class\n======\n" header at the start of your example output, so the regex won't work), although I'd still advocate using the module cheers davis Is this going out live? No, Homer, very few cartoons are broadcast live - it's a terrible strain on the animator's wrist	[reply]
Re: Re: A query on greedy regexps by sch (Pilgrim) on Oct 01, 2002 at 12:29 UTC
Hi - and thanks for the pointer to HPUX::Ioscan. If I was looking to do anything heavyweight then I'd definitely use it - as it is I just want to have a list of devices I can show in different colours, which the module would do but I just thought I'd save the memory :) Of course, it's also (for me) another learning exercise in the black art of regex's.	[reply]
Re: A query on greedy regexps by Hofmator (Curate) on Oct 01, 2002 at 15:11 UTC
Well, you regex is not working, at least not as I'd expect it to work - apart from the fact that the 'Class\n====\n' header is missing. The array @devices gets just the first entry line ('graphics 0 ...'). The reason for this is that the /g modifier on the regex is not matching multiple times because the header is there only once. So do it in two steps like this: `#!/usr/bin/perl $_ = do {local $/; <DATA>}; s/(Class.+\n=+\n)//; $header = $1; @devices = m/(^\w.+\n(?:\s+.+\n)*)/mg; __DATA__ Class Foo ========= graphics 0 1 graph3 CLAIMED INTERFACE Graphics disk 0 2/0/1.0.0 sdisk CLAIMED DEVICE HP +C2247M1 /dev/dsk/c0t0d0 /dev/rdsk/c0t0d0 ctl 0 2/0/1.7.0 sctl CLAIMED DEVICE Initiato +r /dev/rscsi/c0t7d0` [download] as far as I can tell this is not a problem with greedy matching. You just expected the /g modifier to dwim more than it does. -- Hofmator	[reply] [d/l]
Re: Re: A query on greedy regexps by sch (Pilgrim) on Oct 01, 2002 at 15:31 UTC
Oops, you are of course right - just re-ran the code, and it finds the first element but that's it I actually had your suggestion originally (more or less) and just dropped it down to one regex and it seemed to work - yet another perfect example of why I shouldn't make assumptions about what seems simple code.	[reply]