in reply to Replacing Types

this is a very confusing question. do you want to turn "my name is George Bush and I live in America" into "my name is <name> and I live in <country>" or the other way around?

assuming the former, one way would be to first build up a regular expression for each 'type' you want to discover, then loop through them and perform substitutions on your sentence. consider something along these lines..

use strict; use warnings; ## some sample 'types' and potential values my %types = ( name => [ 'George Bush', 'Osama Bin Laden', ], country => [ 'America', 'Your Pants', ], ); ## turn those arrays into foo|bar|baz (quotemeta to deal with potentia +l U.S.A.) my %regexes = map { $_ => join( '|', map quotemeta, @{ $types{$_} }) } + keys %types; ## some sample sentences my @sentences = ( 'my name is George Bush and I live in America', 'my name is Osama Bin Laden and I want to visit Your Pants', ); ## discover 'types' in each sentence foreach my $sentence ( @sentences ) { print "before: $sentence\n"; while ( my ($type, $pattern) = each %regexes ) { $sentence =~ s/$pattern/<$type>/g; } print " after: $sentence\n\n"; }
produces:
before: my name is George Bush and I live in America after: my name is <name> and I live in <country> before: my name is Osama Bin Laden and I want to visit Your Pants after: my name is <name> and I want to visit <country>
if you are trying to go the other way, then consider one of the many templating packages available..

(update: and yeah, if you don't have a pre-defined list of names and countries you are searching for, this is a much different problem, as ayrnieu references above.)

Replies are listed 'Best First'.
Re^2: Replacing Types
by Anonymous Monk on Sep 21, 2006 at 19:12 UTC
    Thank you. This is exactly I want. Are you sure that building this way would be 'optimal' for 1000s of types? I like to generate common pattern of informative sentences from millions of sentences.
      i am not at all sure it is the optimal solution, just one way to do it. it could surely be optimized for speed and/or memory consumption, typically one at the expense of the other. good luck! :-)