in reply to Can this be parsed ?
Without knowing this, we can't know whether to break MOTEL GOLDEN LEEUW <A225> into (MOTEL)()(GOLDEN LEEUW <A225>) or (MOTEL GOLDEN LEEUW)()(<A225>) Assuming for a moment that the latter is the correct way to divide the field, you could do something like
while ( <DATA> ) { if ( /^(.*)\s*($Pred_re)\s*(.*)$/ ) { ($name,$pred,$unknown) = ($1, $2, $3); } else { ($name,$unknown,$pred) = /^(.*)\s*(\S*)$/; } }
Updated: to tweak the whitespace matching.
Update 2: Bah. Forget the feeble effort above.
my $Prep_Re=join '|',('VAN DER','VAN DE','DEN','DE','VAN'); while ( <DATA> ) { chomp; if ( /^(.*)\s+($Prep_Re)\b\s*(.*)$/ ) { ($name, $prep, $other) = ($1, $2, $3); } elsif ( /^(.*)\s+($Prep_Re)\s*$/ ) { ($name, $prep, $other) = ($1, $2, ""); } elsif ( /^(.*)\b\s+(\S+)$/ ) { ($name, $prep, $other) = ($1, "", $2); } else { ($name, $prep, $other) = ($_, "", ""); } print "$name|$prep|$other\n"; } __DATA__ WINTER DE <A240> ZANDEN VAN DER ŤAť JENSEN 230 WOODHEAD <D> BRINK 130,- HEYDIER DEN <240> SMITSER (4X115PJ) LINDEN VAN DER MOTEL GOLDEN LEEUW <A225> __END__ WINTER|DE|<A240> ZANDEN|VAN DER|ŤAť JENSEN||230 WOODHEAD||<D> BRINK||130,- HEYDIER|DEN|<240> SMITSER||(4X115PJ) LINDEN|VAN DER| MOTEL GOLDEN LEEUW||<A225>
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Re: Can this be parsed ?
by ChOas (Curate) on Jul 11, 2002 at 07:31 UTC |