comment on

Here's an example of counting defined sets of "words" (which can be tricky to define) based on the technique described in the Building Regex Alternations Dynamically article by haukex. If you can figure out how to get the contents of your positive and negative word data files into the corresponding arrays (and if my notion of what you want is anywhere near what you actually want), you may be on your way.

Note that the code is set up for case-insensitive matching and counting: the negative word "fourscore" matches "FoUrScOrE" in the example sentence, and so on. Note, again, that the concept of a "word" can be slippery, so the use of the \b boundary assertion, among other details, may not be appropriate.

c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le
"my @positive = qw(nation conceived liberty created equal foo);
 my @negative = qw(fourscore SEVEN fOrTh fathers continent bar);
 ;;
 my $sentence = 'FoUrScOrE and seven years ago '
              . 'our fathers brought forth, on this continent, '
              . 'a new nation, conceived in liberty, and dedicated '
              . 'to the proposition that all men are created equal. '
              . 'Repeat seven nation fathers nation.'
              ;
 ;;
 my %pos = map { lc($_) => 0 } @positive;
 my $rx_pos = make_regex(\%pos);
 print 'for debug: positive rx: ', $rx_pos;
 ;;
 my %neg = map { lc($_) => 0 } @negative;
 my $rx_neg = make_regex(\%neg);
 print 'for debug: negative rx: ', $rx_neg;
 ;;
 my %other;
 my $rx_undefined = qr{ (?! $rx_pos | $rx_neg) }xms;
 my $rx_word      = qr{ \b [[:alpha:]]+ \b }xms;
 ;;
 ++$pos  { lc $_ } for $sentence =~ m{ $rx_pos }xmsg;
 ++$neg  { lc $_ } for $sentence =~ m{ $rx_neg }xmsg;
 ++$other{ lc $_ } for $sentence =~ m{ $rx_undefined $rx_word }xmsg;
 ;;
 dd \%pos;
 dd \%neg;
 dd \%other;
 ;;
 ;;
 sub make_regex {
   my ($hr_wordlist) = @_;
   ;;
   my ($rx) =
     map  qr{ (?i) \b (?: $_) \b }xms,
     join '|',
     map  quotemeta,
     reverse sort
     keys %$hr_wordlist
     ;
   ;;
   return $rx;
   }
"
for debug: positive rx: (?msx-i: (?i) \b (?: nation|liberty|foo|equal|
+created|conceived) \b )
for debug: negative rx: (?msx-i: (?i) \b (?: seven|fourscore|forth|fat
+hers|continent|bar) \b )
{ conceived => 1, created => 1, equal => 1, foo => 0, liberty => 1, na
+tion => 3 }

{ bar => 0, continent => 1, fathers => 2, forth => 1, fourscore => 1, 
+seven => 2 }

{
  a           => 1,
  ago         => 1,
  all         => 1,
  "and"       => 2,
  are         => 1,
  brought     => 1,
  dedicated   => 1,
  in          => 1,
  men         => 1,
  new         => 1,
  on          => 1,
  our         => 1,
  proposition => 1,
  repeat      => 1,
  that        => 1,
  the         => 1,
  this        => 1,
  to          => 1,
  years       => 1,
}
[download]

Update: In the make_regex() function, the lines
reverse sort
map quotemeta,
~~are swapped~~ | were swapped (fixed); they should be
map quotemeta,
reverse sort
i.e., sort-ing, either lexically or by length, should be done on the raw strings before the quotemeta step.

Give a man a fish: <%-{-{-{-<

In reply to Re^3: problem count the number of words (updated) by AnomalousMonk
in thread problem count the number of words by GHMON

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.