tokenize plain text messages

revdiablo has asked for the wisdom of the Perl Monks concerning the following question:

Hi there. I have a toy project I'm working on to do some basic stats on plain text. Obviously I need to tokenize this text so I can count words, and whatnot. I have come up with two approaches; one using s/// most of the time, the other using map, split, foreach, and friends. The version using s/// benchmarks about 78% faster, but that's not really my question. I just wonder if there are any ideas how to make either of these cleaner/simpler/more better.

Without further adieu, here are the subroutines in question:

First, the version that uses list ops:

sub tokenize_msg_w_lists {
    my ($msg) = @_;

    # define consitituent characters.
    my $con = 'A-Za-z0-9\'\$!,.-';

    # first pass. split tokens on any non-consitituent characters.
    my @words = map {s/[,.]$//; $_} split /[^$con]+/, $msg;

    # second pass. split on , or . that aren't surrounded by digits.
    my %words;
    foreach (@words) { $words{$_}++ foreach split /(?<=\D)[,.](?=\D)/,
+ $_; }

    return keys %words;
}
[download]

And here is the version that uses string ops:

sub tokenize_msg_w_strings {
    my ($msg) = @_;

    # define consitituent characters.
    my $con = 'A-Za-z0-9\'\$!,.-';

    # separate tokens with % for later splittage
    $msg =~ s/[^$con]+/%/g;          # replace any non-cons with %
    $msg =~ s/%[,.]/%/g;             # delete [,.] in the front of a t
+oken
    $msg =~ s/[,.]%/%/g;             # delete [,.] in the back of a to
+ken
    $msg =~ s/^[,.]//g;              # delete [,.] at the start of a l
+ine
    $msg =~ s/[,.]$//g;              # delete [,.] at the end of a lin
+e
    $msg =~ s/(?<=\D)[,.](?=\D)/%/g; # delete [,.] not surrounded by d
+igits

    my %words = map {$_=>1} split /%/, $msg;

    return keys %words;
}
[download]

In case you haven't gathered the requirements by looking at the code, I basically want to split on any characters other than alphanumeric and ['$!,.-]. And I only want to keep , and . if they're surrounded by digits. So if my text message is "This, is, an, example. Keep $2.50, 1,500, and 192.168.1.1." I want it to be tokenized as follows: This | is | an | example | Keep | $2.50 | 1,500 | and | 192.168.1.1

Comment on tokenize plain text messages Select or Download Code

Replies are listed 'Best First'.
Re: tokenize plain text messages by BrowserUk (Patriarch) on May 10, 2003 at 00:15 UTC
If you don't find the Regexp::Common::Revdiablo :) module, this might give you a starting place. It's not tested much beyond what you see, and I think it could be simplified. `$s = 'This, is, an, example. Keep $2.50, 1,500, and 192.168.1.1.'; $re_revdiablo = qr[(?:[^\w\'\$!,.-]+\|(?:(?<=\D)[.,])\|(?:[.,](?=\D\|$)) +)+]; print join ' \| ', split $re_revdiablo, $s; This \| is \| an \| example \| Keep \| $2.50 \| 1,500 \| and \| 192.168.1.1` [download] I tried to use the /x modifier to break up the density of the regex, but that doesn't seem to work with split? Update: I'm talking crap. /x does work with split provided you don't put spaces between the \ and the character it is escaping. D'oh! `$re_revdiablo = qr[ (?: # group, no capture [^\w\'\$!,.-] # on anything not in your list \| (?: (?<= \D ) [.,] ) # or . or, if preceded by a non nu +meric \| (?: [.,] (?= \D \| $) # or . or, if followed by a non nu +meric or EOL ) )+ # 1 or more ]x;` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l] [select]
Re: Re: tokenize plain text messages by revdiablo (Prior) on May 10, 2003 at 01:55 UTC
That's a very nice... glad to see a way I can Do It All (tm) with one regex. The only problem is it benchmarks about 53% slower than my `_w_strings` sub. (Also, each one gives very slightly different results with a more complicated test message. ugh.) Here's how I used your regex in a subroutine: `sub tokenize_msg_w_oneregex { my ($msg) = @_; my $re = qr{(?:[^\w\'\$!,.-]+\|(?:(?<=\D)[.,])\|(?:[.,](?=\D\|$)))+} +; my %words = map {$_=>1} split $re, $msg; return keys %words; }` [download] And here's the cmpthese output: `Benchmark: timing 10000 iterations of Lists, One Regex, Strings... Lists: 4 wallclock secs ( 4.15 usr + 0.00 sys = 4.15 CPU) @ 24 +09.64/s (n=10000) One Regex: 4 wallclock secs ( 3.56 usr + 0.00 sys = 3.56 CPU) @ 28 +08.99/s (n=10000) Strings: 2 wallclock secs ( 2.33 usr + 0.00 sys = 2.33 CPU) @ 42 +91.85/s (n=10000) Rate Lists One Regex Strings Lists 2410/s -- -14% -44% One Regex 2809/s 17% -- -35% Strings 4292/s 78% 53% --` [download]	[reply] [d/l] [select]
Re: Re: Re: tokenize plain text messages by BrowserUk (Patriarch) on May 10, 2003 at 02:32 UTC
Try this version of your ....oneregex sub. (Phew! Long names:). `D:\Perl\test>257026 192.168.1.1 \| $2.50 \| Keep \| example \| and \| is \| This \| 1,500 \| an Rate lists strings regex lists 757/s -- -56% -61% strings 1736/s 129% -- -10% regex 1930/s 155% 11% --` [download] As the regex never changes, there is no need to recompile it every time you call the sub, so I made it a constant. I would usually put the `use constant..` inside the sub where it is used to keep it tidy, but I got beaten up because it implies that the constant is lexically scoped which it isn't. This doesn't fool me, but it is your choice. `use constant RE_WORDS => qr[(?:[^\w\'\$!,.-]\|(?:(?<=\D)[.,])\|(?:[.,](? +=\D\|$)))+]; sub tokenize_msg_w_oneregex { my %words; @words{ split RE_WORDS, shift } = (); return keys %words; }` [download] To be fair, a large part of the savings is avoiding the map and initialising every hash element to 1 when you will never use the value. Doing the split inside a hash slice avoids this. You can easily feed this saving back into your other subs which would probably make your strings sub quickest again. But I thought I'd leave that AAEFTR:)	[reply] [d/l] [select]
Re: Re: Re: Re: tokenize plain text messages by revdiablo (Prior) on May 10, 2003 at 04:13 UTC
Re: Re: Re: Re: Re: tokenize plain text messages by BrowserUk (Patriarch) on May 10, 2003 at 06:03 UTC
Some notes below your chosen depth have not been shown here
Re: tokenize plain text messages by Enlil (Parson) on May 10, 2003 at 01:41 UTC
Here is my take on your problem. `use strict; use warnings; $_ = 'This,is, an, example. Keep $2.50, 1,500, and 192.168.1.1.'; my @words = tokenize($_); print join "\|",@words; ###################################### sub tokenize ###################################### { my $msg = shift; my $ntd = qr/(?<=\D)[,.]/; my $dtn = qr/[,.](?=\D\|$)/; my $nv = qr/[^A-Za-z0-9\'\$!-.,]+/; my %words; my @words = grep { !/^$/ and !$words{lc($_)}++} split /$ntd\|$dtn\|$nv/,$msg; return @words; } ##tokenize` [download] Brief Explaination: split will assume anything matching a certain Pattern to be a delimiter. We have supplied it three patterns (three types of delimiters). The first pattern will cause a split anywhere there is not a decimal or comma that is preceded by anything that is not a digit. The second pattern will cause a split anywhere there is not a decimal or a comma that is followed by anything that is "not a digit" or the end of line The third pattern will cause a split on a series of any chars that you did not did not deem valid. split then passes this list to grep which will evaluate every member of the list (in this case it checks to make sure that there is something in each item and that the current list item whose lowercase form has not already been seen) and only returns those list values which pass. Update #1: Another version based on the same regex as above. (I think it looks cleaner): `############################### sub token_2 ############################### { my $msg = $string; my $ntd = qr/(?<=\D)[,.]\|[,.](?=\D\|$)\|[^\w'\$!,.-]+/; our %words; @words{(split /$ntd/,$msg)} = (); return keys %words; } ## token2` [download] update #2:sigh I just noticed my second solution looks much like BrowserUK's solution above. update #3: One more try using `m/()/g` this time: `################################ sub take_3 ################################ { my $msg = $string; my %words; @words{$string =~ m/( (?: (?: [\w'\$!-]\| (?<=\d)[.,](?=\d) ) )+ )/gx}=(); return keys %words; } ##take_3` [download] -enlil	[reply] [d/l] [select]
Re: tokenize plain text messages by rinceWind (Monsignor) on May 10, 2003 at 12:07 UTC
When I want to tokenize something these days, I tend not to use split, but use positive matching as in /.../g. Try this for size: `sub tokenize_msg_w_lists { my ($msg) = @_; # define consitituent characters. my $con = 'A-Za-z0-9\'\$!,.-'; $msg =~ /([$con]+?)[,.]?(?![$con])/g; }` [download]	[reply] [d/l]
Re: Re: tokenize plain text messages by revdiablo (Prior) on May 10, 2003 at 18:00 UTC
Thanks for pointing out an alternate way to do this. I was not previously aware of the match-with-/g-in-list-context behavior, but it looks very useful. Note: BrowserUk did show an example of this in our thread above, but I didn't really notice it until now. I guess it got lost in the noise (of my brain). :)	[reply]
Re: Re: Re: tokenize plain text messages by bart (Canon) on May 11, 2003 at 00:38 UTC
Actually, the most interesting behaviour of `/PAT/g` is when not in list context, but in scalar context. In that case, the next (!) regexp can continue where the previous one left off, using the \G anchor. Conceptually, it is the same as the ^ anchor, which anchors at the beginning of the string — except now it anchors on the current value of pos(), for this string, which is at the end of where the previous pattern matched. Also check out the /c modifier, which prevents reset of the pos pointer to 0 when the match fails, as is the default. So typically, such a lexer could look like this: `$_ = 'And the man said: "Let there be music!"'; while(1) { /\G\s*(?=\S)/gc or last; if(/\G(\w+)/gc) { print "Found a word: $1\n"; } elsif(/\G(['"])/gc) { print "Found a quote: $1\n"; } elsif(/\G([.,;:!?])/gc) { print "Found punctuation: $1\n"; } else { /\G(?=(\S+))/gc; die sprintf "Don't know what to do with what I found next: %s +(position %d)", $1, pos; } } print "Parsing completed successfully.\n";` [download] Result: Found a word: And Found a word: the Found a word: man Found a word: said Found punctuation: : Found a quote: " Found a word: Let Found a word: there Found a word: be Found a word: music Found punctuation: ! Found a quote: " Parsing completed successfully. Try inserting something unrecognizable into your string, like an "=" character, for example. In addition, I'd like to point out that there is at least one Lex module on CPAN: Parse::Lex. However, I am not familiar with how well it works.	[reply] [d/l] [select]
Re: Re: Re: Re: tokenize plain text messages by revdiablo (Prior) on May 11, 2003 at 09:02 UTC
Re: tokenize plain text messages by artist (Parson) on May 10, 2003 at 01:31 UTC
You might want to look at Parse::RecDescent. There is also an FAQ at Parse::RecDescent::FAQ artist	[reply]