comment on

Welcome to the Monastery.

[Aside: Between first reading your post, and subsequently replying, I see you've changed the original. Putting data within <code> tags is good; however, you should indicate the update when doing so after posting. "How do I change/delete my post?" has more about that.]

I used the same tagset as you:

$ cat pm_11148202.tagset.txt
udebe   <ZUL-SIL-0016-n>
ulimi   <ZUL-SIL-0017-n>
izinyo  <ZUL-SIL-0018-n>
izinyo lomhlathi        <ZUL-SIL-0019-n>
ingemuva lomqala        <ZUL-SIL-0024-n>
umphimbo        <ZUL-SIL-0025-n>
[download]

You've shown some sample output — this is good; however, you've not shown the source from which that output is derived — this is less good. Also, I see no correlation between the "taglist" tags and the "output" tags. I made up my own sample input data:

$ cat pm_11148202.txt
Lokho udebe kukwenze isilomo.
Ukuzihlola izinyo kungahlenga izinyo lomhlathi yakho.
Amakhala agxiza amafinyila.
Ulimi amafutha ulimi wonke ULIMI amabheringi.
Sebenzisa amafutha ulimi.
Zama ukugwema ukudla okuncinca udebe.
[download]

I then ran this code:

#!/usr/bin/env perl

use 5.016;
use warnings;
use autodie;

my $corpusname = 'pm_11148202';
my %words2ids;

{
    open my $fh, '<', "$corpusname.tagset.txt";

    while (<$fh>) {
        chomp;
        my ($text, $token) = split /\t/;
        $words2ids{fc $text} = $token;
    }
}

my $alt = join '|', sort {
    length($b) <=> length($a)
} map fc, keys %words2ids;
my $re = qr{(?i:($alt))};
my %found;

{
    open my $in_fh, '<', "$corpusname.txt";
    open my $out_fh, '>', "$corpusname.possible-annotation.txt";

    while (<$in_fh>) {
        s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg;
        print $out_fh $_;
    }
}

delete @words2ids{keys %found};

{
    open my $fh, '>', "$corpusname.tags-not-found.txt";

    for (sort keys %words2ids) {
        say $fh "$_\t$words2ids{$_}";
    }
}
[download]

This produces

$ cat pm_11148202.possible-annotation.txt
Lokho udebe <ZUL-SIL-0016-n> kukwenze isilomo.
Ukuzihlola izinyo <ZUL-SIL-0018-n> kungahlenga izinyo lomhlathi <ZUL-S
+IL-0019-n> yakho.
Amakhala agxiza amafinyila.
Ulimi <ZUL-SIL-0017-n> amafutha ulimi <ZUL-SIL-0017-n> wonke ULIMI <ZU
+L-SIL-0017-n> amabheringi.
Sebenzisa amafutha ulimi <ZUL-SIL-0017-n>.
Zama ukugwema ukudla okuncinca udebe <ZUL-SIL-0016-n>.
[download]

and

$ cat pm_11148202.tags-not-found.txt
ingemuva lomqala        <ZUL-SIL-0024-n>
umphimbo        <ZUL-SIL-0025-n>
[download]

Notes:

I used the fc() function. This is preferred over uc() and lc() for case-insensitive canonicalisation; however, it does require Perl v5.16.
Manually coding I/O exception handling is tedious and error-prone. The autodie pragma does this work for you: I recommend its use.
Note how I've used anonymous blocks. Filehandles are only left open for as long as they are needed; Perl will close them for you at the end of their blocks.

— Ken

In reply to Re: Finding multiword units in a corpus by kcott
in thread Finding multiword units in a corpus by veg_running

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.