Perl Regular Expressions

rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I am trying to split a block of text that looks like this:

ABD - some text
ACDB- some more text
WD  - more text
    - which spills onto the next line
SD  - another line
[download]

Into pairs of text before the "-" and text following.

i.e.

$myHash{'ABD'} = "some text"
$myHash{'ACDB'} = "some more text"
etc.
[download]

The text before the "-" is always 4 characters, at least 2 A-Z's and always uppercase, and there is always a space after the "-".

So far what I have is :

/^([A-Z]{2,4})(\s{0,2})\-\s((.*)\n)*/g
[download]

What I think this matches is :
Start of line
2 - 4 capital alpha characters
0 - 2 whitespaces characters
A hyphen
A whitespace
( Any character any number of time followed by a newline ) any number of times

But it isnt doing what I thought it would so I must have something wrong :-)

Any help you could throw my way would be greatly appreciated.

Cheers,
rsiedl

"Sometimes I think the surest sign that intelligent life exists elsewhere in the universe is that none of it has tried to contact us." Calvin-The Indispensable Calvin and Hobbes

Update:

Thanks very much for your help guys.
This is what I ended doing in the end:

#!/usr/bin/perl
use strict;
use warnings;

# The data
my $data =<< "END";
ABD - some text
ACDB- some more text
WD  - more text
    - which spills onto the next line
SD  - another line
END

# Search and replace a newline followed by four spaces
# followed by a hyphen with nothing
$data =~ s/\n(\s{4})-//ig;

# Split the data into lines
my @lines = split(/\n/, $data);

# Scroll though the lines
foreach my $line ( @lines ) { # start-foreach
        # Check to make sure our line is formatted correctly
        if ($line =~ /^(\S+)\s*-\s+(.+)$/) { # start-if
                # Get the values out of the regex
                my ( $key, $value ) = $line =~ /^(\S+)\s*-\s+(.+)$/;
                # Print the results
                print "$key, $value\n";
        } # end-if
        # Else line is not formatted correctly
        else { # start-else
                print "Badly formatted line!\n";
        } # end-else
} # end-foreach
[download]

Cheers,
rsiedl.

Comment on Perl Regular Expressions Select or Download Code

Replies are listed 'Best First'.
Re: Perl Regular Expressions by kvale (Monsignor) on Mar 22, 2004 at 21:33 UTC
Here is a different approach that splits on whitespace and the dash: `my ($key, $value) = split /\s-\s+/, $line;` [download] Update: Sorry, accidentally submitted too early. Anyway, to handle the continuation lines, we need a loop: `while (<>) { if (/^(\S+)\s-\s+(.+)$/) { push @pair, [$key,$value] if $key; # save prev pair $key = $1; $value = $2; } elsif (/^\s+-\s+(.+)$/) { $value .= $1; } else { print "Badly formatted line: $_\n"; } }` [download] -Mark	[reply] [d/l] [select]
Re: Perl Regular Expressions by cLive ;-) (Prior) on Mar 22, 2004 at 22:01 UTC
timtowtdi `#!/usr/bin/perl use strict; use warnings; my %Hash = (); my $thiskey=''; while(<DATA>) { chomp; my @line = (split /\s-\s/); $thiskey = $line[0] \|\| $thiskey; $Hash{$thiskey} = defined $Hash{$thiskey} ? $Hash{$thiskey}.=" $li +ne[1]" : $line[1]; print "$thiskey $line[1]\n"; } __DATA__ ABD - some text ACDB- some more text WD - more text - which spills onto the next line SD - another line` [download] cLive ;-)	[reply] [d/l]
Re: Perl Regular Expressions by neniro (Priest) on Mar 22, 2004 at 22:06 UTC
I like it small and simple: `#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %hash; my @array; while (<DATA>) { chomp; push @array, split "- " }; %hash = @array; print Dumper(\%hash); __DATA__ ABD - some text ACDB- some more text WD - more text - which spills onto the next line SD - another line` [download] best regards, neniro	[reply] [d/l]
Re: Re: Perl Regular Expressions by CountZero (Bishop) on Mar 22, 2004 at 22:34 UTC
The basic idea is nice,but in practice it doesn't work: It doesn't handle the continuation line the way it should. A second continuation line would clobber the previous one. If there is a hyphen+space in the text, the whole array/hash gets uncoordinated from there on. This can probably be corrected by only splitting on the first hyphen+space combo. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Perl Regular Expressions by tinita (Parson) on Mar 22, 2004 at 21:43 UTC
add the /m modifier so that ^ matches the beginning of a "line" and not the beginning of your whole string. if that doesn't help show us more code, e.g. the part where you execute the regex on the string. for details see perlre	[reply]
Re: Perl Regular Expressions by BUU (Prior) on Mar 22, 2004 at 21:52 UTC
Random micro ops, `([A-Z]{2,4})(\s{0,2})` is simpler to write as `(....)` unless theres some specific reason you want to exclude the characters currently being excluded ( `[^\sA-Z]` ). This `\s((.)\n)` could probably be better written `([^\n]+)\n` which means "One or more non-new line characters, followed by a new line.	[reply] [d/l] [select]