Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have various type of text and I like to replace the item with types. for example: If I have "my name is George Bush and I live in America", it should be "my name is <name> and I live in <country>". I can replace each item by checking entire sentence, one by one. How would I do ? Thanks.

Replies are listed 'Best First'.
Re: Replacing Types
by mreece (Friar) on Sep 21, 2006 at 18:13 UTC
    this is a very confusing question. do you want to turn "my name is George Bush and I live in America" into "my name is <name> and I live in <country>" or the other way around?

    assuming the former, one way would be to first build up a regular expression for each 'type' you want to discover, then loop through them and perform substitutions on your sentence. consider something along these lines..

    use strict; use warnings; ## some sample 'types' and potential values my %types = ( name => [ 'George Bush', 'Osama Bin Laden', ], country => [ 'America', 'Your Pants', ], ); ## turn those arrays into foo|bar|baz (quotemeta to deal with potentia +l U.S.A.) my %regexes = map { $_ => join( '|', map quotemeta, @{ $types{$_} }) } + keys %types; ## some sample sentences my @sentences = ( 'my name is George Bush and I live in America', 'my name is Osama Bin Laden and I want to visit Your Pants', ); ## discover 'types' in each sentence foreach my $sentence ( @sentences ) { print "before: $sentence\n"; while ( my ($type, $pattern) = each %regexes ) { $sentence =~ s/$pattern/<$type>/g; } print " after: $sentence\n\n"; }
    produces:
    before: my name is George Bush and I live in America after: my name is <name> and I live in <country> before: my name is Osama Bin Laden and I want to visit Your Pants after: my name is <name> and I want to visit <country>
    if you are trying to go the other way, then consider one of the many templating packages available..

    (update: and yeah, if you don't have a pre-defined list of names and countries you are searching for, this is a much different problem, as ayrnieu references above.)

      Thank you. This is exactly I want. Are you sure that building this way would be 'optimal' for 1000s of types? I like to generate common pattern of informative sentences from millions of sentences.
        i am not at all sure it is the optimal solution, just one way to do it. it could surely be optimized for speed and/or memory consumption, typically one at the expense of the other. good luck! :-)
Re: Replacing Types
by shmem (Chancellor) on Sep 21, 2006 at 18:18 UTC
    Go read How do I post a question effectively? first. Have you? Homework? Hm.

    Next, more accurately the sentence should read "my name is <name> and I live in <continent>", 'cause that Bush bloke doesn't live in e.g. Venezuela.

    Read up s/// for sub/sti/tute/.

    You have a list of patterns, and a list of replacements. To tie them together in a convenient way there's hashes (see perldata). You could write

    %hash = ( America => 'country', France => 'country', 'George Bush' => 'name', );

    and so on. Writing 'country' over and over is tiresome. It seems easier to key the hash with the replacement patterns and have anonymous arrays as values (see perlref):

    %hash = ( country => [ 'USA', 'France', 'Austria', 'Israel'], name => [ 'Georg Bush', 'Marie LePen', 'Jörg Haider', 'Ehud O +lmert'], ... );

    Having set up that structure, let's go. Suppose we'll reading from a file, have it open and the filehandle FH ready for reading:

    while (defined(my $line = <FH>)) { # read one line into $line while(my($repl,$ary) = each %hash) { # iterate over %hash foreach my $token(@$ary) { # @$ary: dereference the array +in $ary $text =~ s/$token/<$repl>/g; # substitute each token with ha +sh key } } print $line; # done }

    This is an example to start with, there are more and probably better ways to do it. Read up the operators s///, m//, y// and tr// in perlop, and perlretut, perlre for more about regular expressions.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      I think that your approach is better than regexp. It it the best approach or we can do anything better? I have millions of sentences and want to find common patterns of "informative" sentences this way.
        Any solution is quite fine until it has to scale. But that depends on the dataset and on the goals. You were talking of simple search and replace operations; now it's about finding interesting patterns via search operation through a hugh dataset. This usually requires indexing of tokens / database-like operations / vectorizing terms.

        I begin to suspect an XY Problem... maybe you should use a search engine like Swish-E or Lucene.

        What are you really trying to do?

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Replacing Types
by ayrnieu (Beadle) on Sep 21, 2006 at 18:11 UTC

    First, you need to become a computational linguist.

    Articles on arXiv might help; or maybe you could go to college.

    I am not a Computational Linguist, but I would start out on this problem by supposing that I am only interested in nouns including famous personages, as in your two examples, and that I can therefore: guess that adjacent capitalized words might be proper names to collect and check against (say:) news.google , and check other words against a dictionary. Or maybe all of your sentences are so simple, and you can more easily consider them as prolog-style predicates. Or maybe I have a small but difficult corpus and can better spend my time on making an interface nice enough that I can farm the markup out to bored humans, ala Amazon's Mechanical Turk.

    Or maybe you'll find a nice module under Lingua::EN

Re: Replacing Types
by cephas (Pilgrim) on Sep 21, 2006 at 17:35 UTC
    I would look into one of the myriad of templating systems. My preference is generally for Text::Template.
    A reply falls below the community's threshold of quality. You may see it by logging in.