G'day veg_running,
Welcome to the Monastery.
[Aside:
Between first reading your post, and subsequently replying, I see you've changed the original.
Putting data within <code> tags is good; however, you should indicate the update when doing so after posting.
"How do I change/delete my post?" has more about that.]
I used the same tagset as you:
$ cat pm_11148202.tagset.txt
udebe <ZUL-SIL-0016-n>
ulimi <ZUL-SIL-0017-n>
izinyo <ZUL-SIL-0018-n>
izinyo lomhlathi <ZUL-SIL-0019-n>
ingemuva lomqala <ZUL-SIL-0024-n>
umphimbo <ZUL-SIL-0025-n>
You've shown some sample output — this is good; however,
you've not shown the source from which that output is derived — this is less good.
Also, I see no correlation between the "taglist" tags and the "output" tags.
I made up my own sample input data:
$ cat pm_11148202.txt
Lokho udebe kukwenze isilomo.
Ukuzihlola izinyo kungahlenga izinyo lomhlathi yakho.
Amakhala agxiza amafinyila.
Ulimi amafutha ulimi wonke ULIMI amabheringi.
Sebenzisa amafutha ulimi.
Zama ukugwema ukudla okuncinca udebe.
I then ran this code:
#!/usr/bin/env perl
use 5.016;
use warnings;
use autodie;
my $corpusname = 'pm_11148202';
my %words2ids;
{
open my $fh, '<', "$corpusname.tagset.txt";
while (<$fh>) {
chomp;
my ($text, $token) = split /\t/;
$words2ids{fc $text} = $token;
}
}
my $alt = join '|', sort {
length($b) <=> length($a)
} map fc, keys %words2ids;
my $re = qr{(?i:($alt))};
my %found;
{
open my $in_fh, '<', "$corpusname.txt";
open my $out_fh, '>', "$corpusname.possible-annotation.txt";
while (<$in_fh>) {
s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg;
print $out_fh $_;
}
}
delete @words2ids{keys %found};
{
open my $fh, '>', "$corpusname.tags-not-found.txt";
for (sort keys %words2ids) {
say $fh "$_\t$words2ids{$_}";
}
}
This produces
$ cat pm_11148202.possible-annotation.txt
Lokho udebe <ZUL-SIL-0016-n> kukwenze isilomo.
Ukuzihlola izinyo <ZUL-SIL-0018-n> kungahlenga izinyo lomhlathi <ZUL-S
+IL-0019-n> yakho.
Amakhala agxiza amafinyila.
Ulimi <ZUL-SIL-0017-n> amafutha ulimi <ZUL-SIL-0017-n> wonke ULIMI <ZU
+L-SIL-0017-n> amabheringi.
Sebenzisa amafutha ulimi <ZUL-SIL-0017-n>.
Zama ukugwema ukudla okuncinca udebe <ZUL-SIL-0016-n>.
and
$ cat pm_11148202.tags-not-found.txt
ingemuva lomqala <ZUL-SIL-0024-n>
umphimbo <ZUL-SIL-0025-n>
Notes:
-
I used the fc() function.
This is preferred over uc() and lc() for case-insensitive canonicalisation;
however, it does require Perl v5.16.
-
Manually coding I/O exception handling is tedious and error-prone.
The autodie pragma does this work for you:
I recommend its use.
-
Note how I've used anonymous blocks.
Filehandles are only left open for as long as they are needed;
Perl will close them for you at the end of their blocks.
|