I have code that searches for words from a list in a large corpus of tokenised sentences and then assigns a unique ID to those words if it finds them. I would like to upgrade this code to also match multi-word units in the corpus.

My tag set is a simple 2 column file, tab separated. The first column includes the word (or multi-word unit) to find and the second column the tag to assign to it:

udebe <ZUL-SIL-0016-n> ulimi <ZUL-SIL-0017-n> izinyo <ZUL-SIL-0018-n> izinyo lomhlathi <ZUL-SIL-0019-n> ingemuva lomqala <ZUL-SIL-0024-n> umphimbo <ZUL-SIL-0025-n>

The output I require is also a text file and looks like this (produced with the current code below):

Lokho akusoze <ZUL-SIL-1364-b> kukwenze isilomo . Ukuzihlola amabele <ZUL-SIL-1234-n> kungahlenga impilo <ZUL-SIL-0238-n +> yakho . Amakhala agxiza amafinyila <ZUL-SIL-0095-n> . Gcoba <ZUL-SIL-1484-v> amafutha <ZUL-SIL-0572-n> kuwo wonke amabhering +i . Sebenzisa amafutha <ZUL-SIL-0572-n> afanelekile . Zama <ZUL-SIL-0296-n> ukugwema ukudla <ZUL-SIL-0569-n> okuncinca amafu +tha <ZUL-SIL-0572-n> .

My code currently looks like this:

use strict; use warnings; my $corpusname = "GoldStandardCorpus.Original.MG.2022-11-10"; my %words2ids; open my $lemmas, "<", $corpusname.".tagset.txt" or die $!; while (my $line = <$lemmas>) { chomp($line); my ($word, $id) = split "\t", $line; $words2ids{ lc($word) } = $id; } my %freq; open my $output, ">", $corpusname.".possible-annotation.txt" or die $! +; open my $corpus, "<", $corpusname.".txt" or die $!; while (my $line = <$corpus>) { chomp($line); my @tokens = split ' ', $line; foreach my $token (@tokens) { my $lct = lc $token; if (my $id = $words2ids{ $lct }) { $freq{$lct}++; $token .= " $id"; } } say { $output } "@tokens"; } open my $notfound, ">", $corpusname.".tags-not-found.txt" or die $!; foreach my $word (sort keys(%words2ids)) { next if exists $freq{$word}; say { $notfound } "$word\t$words2ids{$word}"; }

Any suggestions would be greatly appreciated! I am thinking some sort of sliding window to search for strings of words, but have no idea how to implement this. Thank you!


In reply to Finding multiword units in a corpus by veg_running

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.