Is this the best regex?

drewk has asked for the wisdom of the Perl Monks concerning the following question:

I am using Perl to find all the PowerPC apps on my newly upgraded Snow Leopard laptop. (Love Perl 5.10 on Snow Leopard btw...)

system_profiler SPApplicationsDataType produces a list of all the applications on OS X. There is an XML output option, but I am working with the text version. system_profiler produces output in text like this:

    SystemUIServer:

      Version: 1.5.5
      Last Modified: 8/23/09 8:22 PM
      Kind: Universal
      Get Info String: SystemUIServer version 1.5.5, Copyright 2000-20
+09 Apple Computer, Inc.
      Location: /System/Library/CoreServices/SystemUIServer.app

    UserNotificationCenter:

      Version: 3.0.0
      Last Modified: 1/29/07 11:03 PM
      Kind: Universal
      Location: /System/Library/CoreServices/UserNotificationCenter.ap
+p
[download]

My perl script to parse this is:

#!/usr/bin/perl

$apps_rep = `system_profiler SPApplicationsDataType 2> /dev/null`;

@apps_lines = split(/\n/,$apps_rep) ;

@apps=();

$count = @apps_lines ;

$i=$j=$k=$p=0;

while ($j<$count) {
    $apps[$i] .= $apps_lines[$j] ;
    $apps[$i] .= "\n" ;
    $i++ if ($apps_lines[$j]) =~ /^\s\s\s\s\S.*:$/;
    $j++;
    }
    
print "$i apps\n" ;

while($k<$i) {
    $_ = $apps[$k++] ;
    if (/Kind: PowerPC/s) {print; $p++} ;
    }    
    
print "$i applications, $p PowerPC applications\n\n";
[download]

Each application record is delimited by four spaces, the app name, and a colon at the end of the line. The regex /^\s\s\s\s\S.*:$/ captures this delimiter, but I cannot use the regex in split(/^\s\s\s\s\S.*:$/,$apps_rep);. Instead I have to read the output by lines, reassemble the lines into an array, and match /Kind: PowerPC/s on the resulting record.

Any wisdom on why I cannot use the split(/^\s\s\s\s\S.*:$/,$apps_rep);. call? The regex works, but not in split? Any better way to do this?

Comment on Is this the best regex? Select or Download Code

Replies are listed 'Best First'.
Re: Is this the best regex? by almut (Canon) on Sep 08, 2009 at 22:18 UTC
The regex `/^\s\s\s\s\S.:$/` captures this delimiter, but I cannot use the regex in `split(/^\s\s\s\s\S.:$/,$apps_rep);` Try `split(/^\s\s\s\s\S.:$/m,$apps_rep);` (note the `/m` — it makes `^` and `$` match the start/end of lines within* the string) You probably also want to put capturing parentheses in the pattern — `/^\s\s\s\s(\S.*):$/m` — in order to keep what was split on, i.e. you'd then get alternating entries for record title (app name) and record body.	[reply] [d/l] [select]
Re^2: Is this the best regex? by ikegami (Patriarch) on Sep 09, 2009 at 00:32 UTC
Alternative that splits the records without doing additional parsing: `split(/^(?=\s\s\s\s\S.*:$)/m, $apps_rep);` [download]	[reply] [d/l]
Re^2: Is this the best regex? by Anonymous Monk on Sep 09, 2009 at 04:54 UTC
Yes -- that fixes split, but my thinking was fuzzy. I got focused on why split did not work and forgot that if I did get it working, the delimiter would be thrown away. In the case, the delimiter is data because it has the program name! Thanks for the education on split. I am sure it will come in handy one day.	[reply]
Re^3: Is this the best regex? by Anonymous Monk on Sep 09, 2009 at 04:59 UTC
Wait. I saw your comment re the capturing parentheses. How do I capture both the fields and the delimiters?	[reply]
Re^4: Is this the best regex? by Anonymous Monk on Sep 09, 2009 at 05:03 UTC
Re: Is this the best regex? by jwkrahn (Abbot) on Sep 09, 2009 at 01:54 UTC
Perhaps this will work better for you: `#!/usr/bin/perl use warnings; use strict; open my $PIPE, '-\|', 'system_profiler SPApplicationsDataType 2> /dev/n +ull' or die "Cannot open pipe from 'system_profiler' $!"; my ( $i, $p ); while ( <$PIPE> ) { $i += /\A\s{4}\S.*:\Z/; $p += /\A\s+Kind: PowerPC\Z/; } close $PIPE or warn $! ? "Error closing 'system_profiler' pipe: $!" : "Exit status $? from 'system_profiler'"; print "$i applications, $p PowerPC applications\n\n";` [download]	[reply] [d/l]
Re^2: Is this the best regex? by Anonymous Monk on Sep 09, 2009 at 04:51 UTC
Yes -- better! It is faster and seems easier to understand. I changed it to: `#!/usr/bin/perl use warnings; use strict; open my $PIPE, '-\|', 'system_profiler SPApplicationsDataType 2> /dev/n +ull' or die "Cannot open pipe from 'system_profiler' $!"; my ( $i, $p, @apps ); while ( <$PIPE> ) { $apps[$i] .= $_; $i += /\A\s{4}\S.:\Z/; $p += /\A\s+Kind: PowerPC\Z/; } close $PIPE or warn $! ? "Error closing 'system_profiler' pipe: $!" : "Exit status $? from 'system_profiler'"; foreach (@apps) { print if /Kind: PowerPC/s; } print "$i applications, $p PowerPC applications\n\n";` [download] Question: is there a difference between `/\A\s{4}\S.:\Z/` and `/^\s{4}\S.*:$/`?	[reply] [d/l] [select]
Re: Is this the best regex? by halfcountplus (Hermit) on Sep 08, 2009 at 22:27 UTC
Well, a big issue is that your regex begins with ^ and ends with $, which is kind of oxymoronic for a split. If you remove the $, it will probably work, unless you were intending what almut recommends, in which case you could also use '\n' instead of $. I'm not sure what you are trying to do, but I would encourage you to go for a more complex data structure here, namely an array of hashes. That means a function which processes each app and returns a hash of its contents. If you want an example, just ask.	[reply]
Re^2: Is this the best regex? by Anonymous Monk on Sep 09, 2009 at 04:57 UTC
I don't understand. Aren't most split delimiters either a \n or \s?	[reply]
Re: Is this the best regex? by toolic (Bishop) on Sep 08, 2009 at 23:24 UTC
There is an XML output option There is a good chance that the XML output would be easier to parse than the plain text output that you have shown since there are many great XML parsers available from CPAN (for example, XML::Twig). If you post a small sample of the XML output, I could try to give you some sample code.	[reply]