rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I am trying to split a block of text that looks like this:

ABD - some text ACDB- some more text WD - more text - which spills onto the next line SD - another line
Into pairs of text before the "-" and text following.

i.e.
$myHash{'ABD'} = "some text" $myHash{'ACDB'} = "some more text" etc.
The text before the "-" is always 4 characters, at least 2 A-Z's and always uppercase, and there is always a space after the "-".

So far what I have is :

/^([A-Z]{2,4})(\s{0,2})\-\s((.*)\n)*/g
What I think this matches is :
Start of line
2 - 4 capital alpha characters
0 - 2 whitespaces characters
A hyphen
A whitespace
( Any character any number of time followed by a newline ) any number of times

But it isnt doing what I thought it would so I must have something wrong :-)

Any help you could throw my way would be greatly appreciated.

Cheers,
rsiedl

"Sometimes I think the surest sign that intelligent life exists elsewhere in the universe is that none of it has tried to contact us." Calvin-The Indispensable Calvin and Hobbes

Update:

Thanks very much for your help guys.
This is what I ended doing in the end:

#!/usr/bin/perl use strict; use warnings; # The data my $data =<< "END"; ABD - some text ACDB- some more text WD - more text - which spills onto the next line SD - another line END # Search and replace a newline followed by four spaces # followed by a hyphen with nothing $data =~ s/\n(\s{4})-//ig; # Split the data into lines my @lines = split(/\n/, $data); # Scroll though the lines foreach my $line ( @lines ) { # start-foreach # Check to make sure our line is formatted correctly if ($line =~ /^(\S+)\s*-\s+(.+)$/) { # start-if # Get the values out of the regex my ( $key, $value ) = $line =~ /^(\S+)\s*-\s+(.+)$/; # Print the results print "$key, $value\n"; } # end-if # Else line is not formatted correctly else { # start-else print "Badly formatted line!\n"; } # end-else } # end-foreach
Cheers,
rsiedl.

Replies are listed 'Best First'.
Re: Perl Regular Expressions
by kvale (Monsignor) on Mar 22, 2004 at 21:33 UTC
    Here is a different approach that splits on whitespace and the dash:
    my ($key, $value) = split /\s*-\s+/, $line;
    Update: Sorry, accidentally submitted too early. Anyway, to handle the continuation lines, we need a loop:
    while (<>) { if (/^(\S+)\s*-\s+(.+)$/) { push @pair, [$key,$value] if $key; # save prev pair $key = $1; $value = $2; } elsif (/^\s+-\s+(.+)$/) { $value .= $1; } else { print "Badly formatted line: $_\n"; } }

    -Mark

Re: Perl Regular Expressions
by cLive ;-) (Prior) on Mar 22, 2004 at 22:01 UTC
    timtowtdi
    #!/usr/bin/perl use strict; use warnings; my %Hash = (); my $thiskey=''; while(<DATA>) { chomp; my @line = (split /\s*-\s*/); $thiskey = $line[0] || $thiskey; $Hash{$thiskey} = defined $Hash{$thiskey} ? $Hash{$thiskey}.=" $li +ne[1]" : $line[1]; print "$thiskey $line[1]\n"; } __DATA__ ABD - some text ACDB- some more text WD - more text - which spills onto the next line SD - another line

    cLive ;-)

Re: Perl Regular Expressions
by neniro (Priest) on Mar 22, 2004 at 22:06 UTC
    I like it small and simple:
    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %hash; my @array; while (<DATA>) { chomp; push @array, split "- " }; %hash = @array; print Dumper(\%hash); __DATA__ ABD - some text ACDB- some more text WD - more text - which spills onto the next line SD - another line
    best regards, neniro
      The basic idea is nice,but in practice it doesn't work:
      • It doesn't handle the continuation line the way it should.
      • A second continuation line would clobber the previous one.
      • If there is a hyphen+space in the text, the whole array/hash gets uncoordinated from there on. This can probably be corrected by only splitting on the first hyphen+space combo.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Perl Regular Expressions
by tinita (Parson) on Mar 22, 2004 at 21:43 UTC
    add the /m modifier so that ^ matches the beginning of a "line" and not the beginning of your whole string.
    if that doesn't help show us more code, e.g. the part where you execute the regex on the string.
    for details see perlre
Re: Perl Regular Expressions
by BUU (Prior) on Mar 22, 2004 at 21:52 UTC
    Random micro ops, ([A-Z]{2,4})(\s{0,2}) is simpler to write as (....) unless theres some specific reason you want to exclude the characters currently being excluded ( [^\sA-Z] ).

    This \s((.*)\n)* could probably be better written ([^\n]+)\n which means "One or more non-new line characters, followed by a new line.