veg_running has asked for the wisdom of the Perl Monks concerning the following question:
Hi all, I need help adapting this code to assign multiple tags to tokens in a corpus. The code currently only prints the last tag assigned, but when I took a closer look at my tagset, I realised that the same token could have multiple tags and I would like to assign them all so that a human could check these later.
An example of the input here:
Abanye abahlanu balimale kanzima . Abanye abanazo izinkomo noma izimbuzi . Abaziphathi ngendlela ekhombisa ukufuna ukudlala . Abazithamundayo bathi uzimisele ngokuliphika icala . Abazi ukuthi bakhuluma ngani . abazonikwa amalungelo okudoba ezinga elincane . Ahluleke ngisho ukukuphulula umhlane njengasemihleni . Ahluleke ukuzibamba, azidedele inkaba . Ahluleke ukuzibamba uMaMthembu .
An example of the tagset:
icala <ZUL-SIL-0035-n> ukhalo <ZUL-SIL-0036-n> inkaba <ZUL-SIL-0037-n> inkaba <ZUL-SIL-0038-n> isisu <ZUL-SIL-0039-n> isisu <ZUL-SIL-0040-n> isibeletho <ZUL-SIL-0041-n> umhlane <ZUL-SIL-0042-n> iqolo <ZUL-SIL-0043-n> izinqe <ZUL-SIL-0044-n> umdidi <ZUL-SIL-0045-n> umphambili <ZUL-SIL-0046-n> amasende <ZUL-SIL-0047-n> inkomo <ZUL-SIL-0048-n> ubhontshisi <ZUL-SIL-0049-n> ingalo <ZUL-SIL-0050-n> ukuthi bakhuluma <ZUL-SIL-1800-n>
An example of the output I hope to get (with a tab separating tags where two were assigned):
Abanye abahlanu balimale kanzima . Abanye abanazo izinkomo <ZUL-SIL-0048-n> noma izimbuzi . Abaziphathi ngendlela ekhombisa ukufuna ukudlala . Abazithamundayo bathi uzimisele ngokuliphika icala <ZUL-SIL-0035-n> . Abazi ukuthi bakhuluma <ZUL-SIL-1800-n> ngani . abazonikwa amalungelo isisu <ZUL-SIL-0039-n>\t<ZUL-SIL-0040-n> ezinga +elincane . Ahluleke ngisho ukukuphulula umhlane <ZUL-SIL-0042-n> njengasemihleni +. Ahluleke ukuzibamba, azidedele inkaba <ZUL-SIL-0037-n>\t<ZUL-SIL-0038- +n> . Ahluleke ukuzibamba uMaMthembu .
The code I currently have:
#!/usr/bin/env perl use 5.016; use warnings; use autodie; my $corpusname = 'GFSEBcorpus.zul_selected-sentences_original'; my %words2ids; { open my $fh, '<', "$corpusname.example.tagset.txt"; while (<$fh>) { chomp; my ($text, $token) = split /\t/; $words2ids{fc $text} = $token; } } my $alt = join '|', sort { length($b) <=> length($a) } map fc, keys %words2ids; my $re = qr{(?i:($alt))}; my %found; { open my $in_fh, '<', "$corpusname.txt"; open my $out_fh, '>', "$corpusname.possible-annotation_example.txt +"; while (<$in_fh>) { s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg; print $out_fh $_; } } delete @words2ids{keys %found}; { open my $fh, '>', "$corpusname.tags-not-found_example.txt"; for (sort keys %words2ids) { say $fh "$_\t$words2ids{$_}"; } }
Thank you for the help!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Tagging a corpus with multiple tags
by Corion (Patriarch) on Nov 24, 2022 at 12:18 UTC |