Replacing Types

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Replacing Types by mreece (Friar) on Sep 21, 2006 at 18:13 UTC
this is a very confusing question. do you want to turn "my name is George Bush and I live in America" into "my name is <name> and I live in <country>" or the other way around? assuming the former, one way would be to first build up a regular expression for each 'type' you want to discover, then loop through them and perform substitutions on your sentence. consider something along these lines.. use strict; use warnings; ## some sample 'types' and potential values my %types = ( name => [ 'George Bush', 'Osama Bin Laden', ], country => [ 'America', 'Your Pants', ], ); ## turn those arrays into foo\|bar\|baz (quotemeta to deal with potentia +l U.S.A.) my %regexes = map { $_ => join( '\|', map quotemeta, @{ $types{$_} }) } + keys %types; ## some sample sentences my @sentences = ( 'my name is George Bush and I live in America', 'my name is Osama Bin Laden and I want to visit Your Pants', ); ## discover 'types' in each sentence foreach my $sentence ( @sentences ) { print "before: $sentence\n"; while ( my ($type, $pattern) = each %regexes ) { $sentence =~ s/$pattern/<$type>/g; } print " after: $sentence\n\n"; } [download] produces: `before: my name is George Bush and I live in America after: my name is <name> and I live in <country> before: my name is Osama Bin Laden and I want to visit Your Pants after: my name is <name> and I want to visit <country>` [download] if you are trying to go the other way, then consider one of the many templating packages available.. (update: and yeah, if you don't have a pre-defined list of names and countries you are searching for, this is a much different problem, as ayrnieu references above.)	[reply] [d/l] [select]
Re^2: Replacing Types by Anonymous Monk on Sep 21, 2006 at 19:12 UTC
Thank you. This is exactly I want. Are you sure that building this way would be 'optimal' for 1000s of types? I like to generate common pattern of informative sentences from millions of sentences.	[reply]
Re^3: Replacing Types by mreece (Friar) on Sep 21, 2006 at 20:26 UTC
i am not at all sure it is the optimal solution, just one way to do it. it could surely be optimized for speed and/or memory consumption, typically one at the expense of the other. good luck! :-)	[reply]
Re: Replacing Types by shmem (Chancellor) on Sep 21, 2006 at 18:18 UTC
Go read How do I post a question effectively? first. Have you? Homework? Hm. Next, more accurately the sentence should read "my name is <name> and I live in <continent>", 'cause that Bush bloke doesn't live in e.g. Venezuela. Read up s/// for `sub/sti/tute/`. You have a list of patterns, and a list of replacements. To tie them together in a convenient way there's hashes (see perldata). You could write `%hash = ( America => 'country', France => 'country', 'George Bush' => 'name', );` [download] and so on. Writing 'country' over and over is tiresome. It seems easier to key the hash with the replacement patterns and have anonymous arrays as values (see perlref): `%hash = ( country => [ 'USA', 'France', 'Austria', 'Israel'], name => [ 'Georg Bush', 'Marie LePen', 'Jörg Haider', 'Ehud O +lmert'], ... );` [download] Having set up that structure, let's go. Suppose we'll reading from a file, have it open and the filehandle FH ready for reading: `while (defined(my $line = <FH>)) { # read one line into $line while(my($repl,$ary) = each %hash) { # iterate over %hash foreach my $token(@$ary) { # @$ary: dereference the array +in $ary $text =~ s/$token/<$repl>/g; # substitute each token with ha +sh key } } print $line; # done }` [download] This is an example to start with, there are more and probably better ways to do it. Read up the operators s///, m//, y// and tr// in perlop, and perlretut, perlre for more about regular expressions. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re^2: Replacing Types by Anonymous Monk on Sep 21, 2006 at 19:21 UTC
I think that your approach is better than regexp. It it the best approach or we can do anything better? I have millions of sentences and want to find common patterns of "informative" sentences this way.	[reply]
Re^3: Replacing Types by shmem (Chancellor) on Sep 21, 2006 at 20:38 UTC
Any solution is quite fine until it has to scale. But that depends on the dataset and on the goals. You were talking of simple search and replace operations; now it's about finding interesting patterns via search operation through a hugh dataset. This usually requires indexing of tokens / database-like operations / vectorizing terms. I begin to suspect an XY Problem... maybe you should use a search engine like Swish-E or Lucene. What are you really trying to do? --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re: Replacing Types by ayrnieu (Beadle) on Sep 21, 2006 at 18:11 UTC
First, you need to become a computational linguist. Articles on arXiv might help; or maybe you could go to college. I am not a Computational Linguist, but I would start out on this problem by supposing that I am only interested in nouns including famous personages, as in your two examples, and that I can therefore: guess that adjacent capitalized words might be proper names to collect and check against (say:) news.google , and check other words against a dictionary. Or maybe all of your sentences are so simple, and you can more easily consider them as prolog-style predicates. Or maybe I have a small but difficult corpus and can better spend my time on making an interface nice enough that I can farm the markup out to bored humans, ala Amazon's Mechanical Turk. Or maybe you'll find a nice module under Lingua::EN	[reply]
Re: Replacing Types by cephas (Pilgrim) on Sep 21, 2006 at 17:35 UTC
I would look into one of the myriad of templating systems. My preference is generally for Text::Template.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.