comment on

I have code that searches for words from a list in a large corpus of tokenised sentences and then assigns a unique ID to those words if it finds them. I would like to upgrade this code to also match multi-word units in the corpus.

My tag set is a simple 2 column file, tab separated. The first column includes the word (or multi-word unit) to find and the second column the tag to assign to it:

udebe    <ZUL-SIL-0016-n>
ulimi    <ZUL-SIL-0017-n>
izinyo    <ZUL-SIL-0018-n>
izinyo lomhlathi    <ZUL-SIL-0019-n>
ingemuva lomqala    <ZUL-SIL-0024-n>
umphimbo    <ZUL-SIL-0025-n>
[download]

The output I require is also a text file and looks like this (produced with the current code below):

Lokho akusoze <ZUL-SIL-1364-b> kukwenze isilomo .
Ukuzihlola amabele <ZUL-SIL-1234-n> kungahlenga impilo <ZUL-SIL-0238-n
+> yakho .
Amakhala agxiza amafinyila <ZUL-SIL-0095-n> .
Gcoba <ZUL-SIL-1484-v> amafutha <ZUL-SIL-0572-n> kuwo wonke amabhering
+i .
Sebenzisa amafutha <ZUL-SIL-0572-n> afanelekile .
Zama <ZUL-SIL-0296-n> ukugwema ukudla <ZUL-SIL-0569-n> okuncinca amafu
+tha <ZUL-SIL-0572-n> .
[download]

My code currently looks like this:


use strict;
use warnings;

my $corpusname = "GoldStandardCorpus.Original.MG.2022-11-10";

my %words2ids; 

open my $lemmas, "<", $corpusname.".tagset.txt" or die $!;
while (my $line = <$lemmas>) {
  chomp($line);
  my ($word, $id) = split "\t", $line;
  $words2ids{ lc($word) } = $id;
}

my %freq;
open my $output, ">", $corpusname.".possible-annotation.txt" or die $!
+;
open my $corpus, "<", $corpusname.".txt" or die $!;

while (my $line = <$corpus>) {
  chomp($line);
  my @tokens = split ' ', $line;
  foreach my $token (@tokens) {
    my $lct = lc $token;
    if (my $id = $words2ids{ $lct }) { 
      $freq{$lct}++;     
      $token .= " $id";    
    }
   }

    say { $output } "@tokens"; 
}

open my $notfound, ">", $corpusname.".tags-not-found.txt" or die $!;
foreach my $word (sort keys(%words2ids)) {
  next if exists $freq{$word}; 
    say { $notfound } "$word\t$words2ids{$word}";
}
[download]

Any suggestions would be greatly appreciated! I am thinking some sort of sliding window to search for strings of words, but have no idea how to implement this. Thank you!

In reply to Finding multiword units in a corpus by veg_running

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.