sch has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,
I was hoping some of you could cast your experienced eyes over this.

I'm trying to process some text as shown below:

graphics 0 1 graph3 CLAIMED INTERFACE Graphics ext_bus 0 2/0/1 c720 CLAIMED INTERFACE Built-in + SCSI disk 0 2/0/1.0.0 sdisk CLAIMED DEVICE HP +C2247M1 /dev/dsk/c0t0d0 /dev/rdsk/c0t0d0 disk 1 2/0/1.2.0 sdisk CLAIMED DEVICE TOSHIBA +CD-ROM XM-3401TA /dev/dsk/c0t2d0 /dev/rdsk/c0t2d0 disk 2 2/0/1.6.0 sdisk CLAIMED DEVICE SEAGATE +ST31200N /dev/dsk/c0t6d0 /dev/rdsk/c0t6d0 ctl 0 2/0/1.7.0 sctl CLAIMED DEVICE Initiato +r /dev/rscsi/c0t7d0 lan 0 2/0/2 lan2 CLAIMED INTERFACE Built-in + LAN /dev/diag/lan0 /dev/ether0 tty 0 2/0/4 asio0 CLAIMED INTERFACE Built-in + RS-232C /dev/diag/mux0 /dev/mux0 /dev/tty0p0 ext_bus 1 2/0/6 CentIf CLAIMED INTERFACE Built-in + Parallel Interface /dev/c1t0d0_lp audio 0 2/0/8 audio CLAIMED INTERFACE Built-in + Audio /dev/audio /dev/audioEL_0 /dev/audioLL +_0 /dev/audioBA /dev/audioEU /dev/audioLU + /dev/audioBA_0 /dev/audioEU_0 /dev/audioLU +_0 /dev/audioBL /dev/audioIA /dev/audioNA + /dev/audioBL_0 /dev/audioIA_0 /dev/audioNA +_0 /dev/audioBU /dev/audioIL /dev/audioNL + /dev/audioBU_0 /dev/audioIL_0 /dev/audioNL +_0 /dev/audioCtl /dev/audioIU /dev/audioNU + /dev/audioCtl_0 /dev/audioIU_0 /dev/audioNU +_0 /dev/audioEA /dev/audioLA /dev/audio_0 + /dev/audioEA_0 /dev/audioLA_0 /dev/audioEL /dev/audioLL pc 0 2/0/10 fdc CLAIMED INTERFACE Built-in + Floppy Drive ps2 0 2/0/11 ps2 CLAIMED

(some of you may recognise this as an hpux ioscan output)
and I have the following chunk of code:

 ($header,@devices) = m/(Class.+\n=+\n)(^\w.+\n(?:\s+.+\n)*)/mg

What I'm trying to do is split out each seperate device and its associated text into an array

On my journey to enlightenment, I've started to venture beyond the world of very simple regexps into slightly more involved versions, and reading through perlmonks I've seen mention of greedy regexps and I think I understand why they're a bad thing

I suspect that this regexp is a prime example of such a beast, but I was hoping for the views of more enlightened members of PM and any pointers on how the regexp could be improved if necessary

Edited: ~Tue Oct 1 16:07:42 2002 (GMT) by footpad: Replaced <pre> tags with <code> tags, per Consideration

Replies are listed 'Best First'.
Re: A query on greedy regexps
by broquaint (Abbot) on Oct 01, 2002 at 11:36 UTC
    For a good start on greedy regexps see Ovid's classic Death to Dot Star!. Essentially the greediness of a regexp relates to the fact that both * and + will match the preceding atom for as long as possible. However you can alter this behaviour by appending either modifier with a question mark e.g
    my $str = "123 foo 456 foo 789 foo 10"; # will match everything up to the last 'foo' print "dot star greedy: ", $str =~ /(.*)foo/, $/; # will match everything up to the first 'foo' print "dot star non-greedy: ", $str =~ /(.*?)foo/, $/; __output__ dot star greedy: 123 foo 456 foo 789 dot star non-greedy: 123

    HTH

    _________
    broquaint

    update: made code more illustrative

Re: A query on greedy regexps
by davis (Vicar) on Oct 01, 2002 at 12:23 UTC

    Greedy operators, as explained by broquaint, gobble up as much of the input text as possible that still allows the regex to match. AIUI, if the match fails, perl's regular expression engine will force one character (unicode aside) from the text matched by the .* to be "given back", to see if the rest of the expression can match. (Caveat: I've realised I need to read O'Reilly's "Mastering Regular Expressions" again, so this might well be wrong.)

    Looking at the regex you've created, I can't see any advantage to making the quantifiers non-greedy - they all need to match to the end of the line anyway.

    If you're looking for a robust solution, then, as I mentioned in the CB, I'd recommend installing HPUX::Ioscan and using that, or at least looking at the code to see how it does it. (Hint: It does it by looping over each line)

    Summary: I think your regex is fine (although you forgot the "Class\n======\n" header at the start of your example output, so the regex won't work), although I'd still advocate using the module

    cheers
    davis
    Is this going out live?
    No, Homer, very few cartoons are broadcast live - it's a terrible strain on the animator's wrist

      Hi - and thanks for the pointer to HPUX::Ioscan. If I was looking to do anything heavyweight then I'd definitely use it - as it is I just want to have a list of devices I can show in different colours, which the module would do but I just thought I'd save the memory :)

      Of course, it's also (for me) another learning exercise in the black art of regex's.

Re: A query on greedy regexps
by Hofmator (Curate) on Oct 01, 2002 at 15:11 UTC

    Well, you regex is not working, at least not as I'd expect it to work - apart from the fact that the 'Class\n====\n' header is missing. The array @devices gets just the first entry line ('graphics 0 ...'). The reason for this is that the /g modifier on the regex is not matching multiple times because the header is there only once.

    So do it in two steps like this:

    #!/usr/bin/perl $_ = do {local $/; <DATA>}; s/(Class.+\n=+\n)//; $header = $1; @devices = m/(^\w.+\n(?:\s+.+\n)*)/mg; __DATA__ Class Foo ========= graphics 0 1 graph3 CLAIMED INTERFACE Graphics disk 0 2/0/1.0.0 sdisk CLAIMED DEVICE HP +C2247M1 /dev/dsk/c0t0d0 /dev/rdsk/c0t0d0 ctl 0 2/0/1.7.0 sctl CLAIMED DEVICE Initiato +r /dev/rscsi/c0t7d0
    as far as I can tell this is not a problem with greedy matching. You just expected the /g modifier to dwim more than it does.

    -- Hofmator

      Oops, you are of course right - just re-ran the code, and it finds the first element but that's it

      I actually had your suggestion originally (more or less) and just dropped it down to one regex and it seemed to work - yet another perfect example of why I shouldn't make assumptions about what seems simple code.