in reply to Can this be parsed ?

Can names have embedded spaces? Can the "haven't got a clue" part have embedded spaces?

Without knowing this, we can't know whether to break   MOTEL GOLDEN LEEUW <A225> into   (MOTEL)()(GOLDEN LEEUW <A225>) or   (MOTEL GOLDEN LEEUW)()(<A225>) Assuming for a moment that the latter is the correct way to divide the field, you could do something like

while ( <DATA> ) { if ( /^(.*)\s*($Pred_re)\s*(.*)$/ ) { ($name,$pred,$unknown) = ($1, $2, $3); } else { ($name,$unknown,$pred) = /^(.*)\s*(\S*)$/; } }

Updated: to tweak the whitespace matching.

Update 2: Bah. Forget the feeble effort above.

my $Prep_Re=join '|',('VAN DER','VAN DE','DEN','DE','VAN'); while ( <DATA> ) { chomp; if ( /^(.*)\s+($Prep_Re)\b\s*(.*)$/ ) { ($name, $prep, $other) = ($1, $2, $3); } elsif ( /^(.*)\s+($Prep_Re)\s*$/ ) { ($name, $prep, $other) = ($1, $2, ""); } elsif ( /^(.*)\b\s+(\S+)$/ ) { ($name, $prep, $other) = ($1, "", $2); } else { ($name, $prep, $other) = ($_, "", ""); } print "$name|$prep|$other\n"; } __DATA__ WINTER DE <A240> ZANDEN VAN DER ŤAť JENSEN 230 WOODHEAD <D> BRINK 130,- HEYDIER DEN <240> SMITSER (4X115PJ) LINDEN VAN DER MOTEL GOLDEN LEEUW <A225> __END__ WINTER|DE|<A240> ZANDEN|VAN DER|ŤAť JENSEN||230 WOODHEAD||<D> BRINK||130,- HEYDIER|DEN|<240> SMITSER||(4X115PJ) LINDEN|VAN DER| MOTEL GOLDEN LEEUW||<A225>

Replies are listed 'Best First'.
Re: Re: Can this be parsed ?
by ChOas (Curate) on Jul 11, 2002 at 07:31 UTC
    Okay, I`ll reply here, might be easier :) Results:
    WINTER|DE|<A240> <- Parsed correctly ZANDEN VAN |DE|R ŤAť <- should be: ZANDEN|VAN DER|ŤAť JENSEN 230|| <- should be: JENSEN||230 WOODHEAD <D>|| <- should be: WOODHEAD||<D> BRINK 130,-|| <- should be: BRINK||130,- HEYDIER |DEN|<240> <- Parsed correctly SMITSER (4X115PJ)|| <- should be: SMITSER||(4X115PJ) LINDEN VAN |DE|R <- should be: LINDEN|VAN DER| MOTEL GOL|DEN|LEEUW <A225> <- should be: MOTEL GOLDEN LEEUW||<A225>
    does this help ? ... I will try to find a larger data set...

    btw, this is the result of my original code:
    WINTER|DE| <A240> ZANDEN|VAN DER| ŤAť JENSEN||230 WOODHEAD||<D> BRINK||130,- HEYDIER|DEN| <240> SMITSER||(4X115PJ) LINDEN|VAN DER| MOTEL GOLDEN LEEUW||<A225>

    </code>
    GreetZ!,
      ChOas

    print "profeth still\n" if /bird|devil/;