Tagging a corpus with multiple tags

veg_running has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I need help adapting this code to assign multiple tags to tokens in a corpus. The code currently only prints the last tag assigned, but when I took a closer look at my tagset, I realised that the same token could have multiple tags and I would like to assign them all so that a human could check these later.

An example of the input here:

 Abanye abahlanu balimale kanzima .
Abanye abanazo izinkomo noma izimbuzi .
Abaziphathi ngendlela ekhombisa ukufuna ukudlala .
Abazithamundayo bathi uzimisele ngokuliphika icala .
Abazi ukuthi bakhuluma ngani .
abazonikwa amalungelo okudoba ezinga elincane .
Ahluleke ngisho ukukuphulula umhlane njengasemihleni .
Ahluleke ukuzibamba, azidedele inkaba .
Ahluleke ukuzibamba uMaMthembu .
[download]

An example of the tagset:

icala    <ZUL-SIL-0035-n>
ukhalo    <ZUL-SIL-0036-n>
inkaba    <ZUL-SIL-0037-n>
inkaba    <ZUL-SIL-0038-n>
isisu    <ZUL-SIL-0039-n>
isisu    <ZUL-SIL-0040-n>
isibeletho    <ZUL-SIL-0041-n>
umhlane    <ZUL-SIL-0042-n>
iqolo    <ZUL-SIL-0043-n>
izinqe    <ZUL-SIL-0044-n>
umdidi    <ZUL-SIL-0045-n>
umphambili    <ZUL-SIL-0046-n>
amasende    <ZUL-SIL-0047-n>
inkomo    <ZUL-SIL-0048-n>
ubhontshisi    <ZUL-SIL-0049-n>
ingalo    <ZUL-SIL-0050-n>
ukuthi bakhuluma    <ZUL-SIL-1800-n>
[download]

An example of the output I hope to get (with a tab separating tags where two were assigned):

Abanye abahlanu balimale kanzima .
Abanye abanazo izinkomo <ZUL-SIL-0048-n> noma izimbuzi .
Abaziphathi ngendlela ekhombisa ukufuna ukudlala .
Abazithamundayo bathi uzimisele ngokuliphika icala <ZUL-SIL-0035-n> .
Abazi ukuthi bakhuluma <ZUL-SIL-1800-n> ngani .
abazonikwa amalungelo isisu <ZUL-SIL-0039-n>\t<ZUL-SIL-0040-n> ezinga 
+elincane .
Ahluleke ngisho ukukuphulula umhlane <ZUL-SIL-0042-n> njengasemihleni 
+.
Ahluleke ukuzibamba, azidedele inkaba <ZUL-SIL-0037-n>\t<ZUL-SIL-0038-
+n> .
Ahluleke ukuzibamba uMaMthembu .
[download]

The code I currently have:

 #!/usr/bin/env perl

use 5.016;
use warnings;
use autodie;

my $corpusname = 'GFSEBcorpus.zul_selected-sentences_original';
my %words2ids;

{
    open my $fh, '<', "$corpusname.example.tagset.txt";

    while (<$fh>) {
        chomp;
        my ($text, $token) = split /\t/;
        $words2ids{fc $text} = $token;
    }
}

my $alt = join '|', sort {
    length($b) <=> length($a)
} map fc, keys %words2ids;
my $re = qr{(?i:($alt))};
my %found;

{
    open my $in_fh, '<', "$corpusname.txt";
    open my $out_fh, '>', "$corpusname.possible-annotation_example.txt
+";

    while (<$in_fh>) {
        s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg;
        print $out_fh $_;
    }
}

delete @words2ids{keys %found};

{
    open my $fh, '>', "$corpusname.tags-not-found_example.txt";

    for (sort keys %words2ids) {
        say $fh "$_\t$words2ids{$_}";
    }
}
[download]

Thank you for the help!

Comment on Tagging a corpus with multiple tags Select or Download Code

Replies are listed 'Best First'.
Re: Tagging a corpus with multiple tags by Corion (Patriarch) on Nov 24, 2022 at 12:18 UTC
kcott suggested a good approach in their reply to Re^2: Finding multiword units in a corpus - how does that approach fail for you?	[reply]