Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Finding multiword units in a corpus

by veg_running (Initiate)
on Nov 16, 2022 at 12:43 UTC ( [id://11148202]=perlquestion: print w/replies, xml ) Need Help??

veg_running has asked for the wisdom of the Perl Monks concerning the following question:

I have code that searches for words from a list in a large corpus of tokenised sentences and then assigns a unique ID to those words if it finds them. I would like to upgrade this code to also match multi-word units in the corpus.

My tag set is a simple 2 column file, tab separated. The first column includes the word (or multi-word unit) to find and the second column the tag to assign to it:

udebe <ZUL-SIL-0016-n> ulimi <ZUL-SIL-0017-n> izinyo <ZUL-SIL-0018-n> izinyo lomhlathi <ZUL-SIL-0019-n> ingemuva lomqala <ZUL-SIL-0024-n> umphimbo <ZUL-SIL-0025-n>

The output I require is also a text file and looks like this (produced with the current code below):

Lokho akusoze <ZUL-SIL-1364-b> kukwenze isilomo . Ukuzihlola amabele <ZUL-SIL-1234-n> kungahlenga impilo <ZUL-SIL-0238-n +> yakho . Amakhala agxiza amafinyila <ZUL-SIL-0095-n> . Gcoba <ZUL-SIL-1484-v> amafutha <ZUL-SIL-0572-n> kuwo wonke amabhering +i . Sebenzisa amafutha <ZUL-SIL-0572-n> afanelekile . Zama <ZUL-SIL-0296-n> ukugwema ukudla <ZUL-SIL-0569-n> okuncinca amafu +tha <ZUL-SIL-0572-n> .

My code currently looks like this:

use strict; use warnings; my $corpusname = "GoldStandardCorpus.Original.MG.2022-11-10"; my %words2ids; open my $lemmas, "<", $corpusname.".tagset.txt" or die $!; while (my $line = <$lemmas>) { chomp($line); my ($word, $id) = split "\t", $line; $words2ids{ lc($word) } = $id; } my %freq; open my $output, ">", $corpusname.".possible-annotation.txt" or die $! +; open my $corpus, "<", $corpusname.".txt" or die $!; while (my $line = <$corpus>) { chomp($line); my @tokens = split ' ', $line; foreach my $token (@tokens) { my $lct = lc $token; if (my $id = $words2ids{ $lct }) { $freq{$lct}++; $token .= " $id"; } } say { $output } "@tokens"; } open my $notfound, ">", $corpusname.".tags-not-found.txt" or die $!; foreach my $word (sort keys(%words2ids)) { next if exists $freq{$word}; say { $notfound } "$word\t$words2ids{$word}"; }

Any suggestions would be greatly appreciated! I am thinking some sort of sliding window to search for strings of words, but have no idea how to implement this. Thank you!

Replies are listed 'Best First'.
Re: Finding multiword units in a corpus
by Corion (Patriarch) on Nov 16, 2022 at 12:52 UTC

    Currently, you have a single-level data structure in a hash that maps a "word" (potentially two or more, space-separated) to a token.

    Your code only processes "words" without a space in them.

    Hashes support lookup by single entries very well. They don't do well for substring lookups.

    I would therefore restructure your token-finding data structure to a multi-level data structure that is organized as a tree of hashes, with the keys being the words:

    my $%words2ids = ( ulimi => '<ZUL-SIL-0017-n>', izinyo => { '' => '<ZUL-SIL-0018-n>', # no known word follows lomhlathi => '<ZUL-SIL-0019-n>', }, ingemuva => { lomqala => '<ZUL-SIL-0024-n>', }, );

    I use the empty string if the word is found and is not followed by any word associated with it.

    Then, when looking up words in the structure, if you find a token, you output it. Otherwise, if you don't find the word at all, output the token associated with the empty string at the previous position. If you find the token, you descend further in that tree.

    Consider playing through the approach with pen and paper first to get an understanding of the data structure.

Re: Finding multiword units in a corpus
by kcott (Archbishop) on Nov 16, 2022 at 15:41 UTC

    G'day veg_running,

    Welcome to the Monastery.

    [Aside: Between first reading your post, and subsequently replying, I see you've changed the original. Putting data within <code> tags is good; however, you should indicate the update when doing so after posting. "How do I change/delete my post?" has more about that.]

    I used the same tagset as you:

    $ cat pm_11148202.tagset.txt udebe <ZUL-SIL-0016-n> ulimi <ZUL-SIL-0017-n> izinyo <ZUL-SIL-0018-n> izinyo lomhlathi <ZUL-SIL-0019-n> ingemuva lomqala <ZUL-SIL-0024-n> umphimbo <ZUL-SIL-0025-n>

    You've shown some sample output — this is good; however, you've not shown the source from which that output is derived — this is less good. Also, I see no correlation between the "taglist" tags and the "output" tags. I made up my own sample input data:

    $ cat pm_11148202.txt Lokho udebe kukwenze isilomo. Ukuzihlola izinyo kungahlenga izinyo lomhlathi yakho. Amakhala agxiza amafinyila. Ulimi amafutha ulimi wonke ULIMI amabheringi. Sebenzisa amafutha ulimi. Zama ukugwema ukudla okuncinca udebe.

    I then ran this code:

    #!/usr/bin/env perl use 5.016; use warnings; use autodie; my $corpusname = 'pm_11148202'; my %words2ids; { open my $fh, '<', "$corpusname.tagset.txt"; while (<$fh>) { chomp; my ($text, $token) = split /\t/; $words2ids{fc $text} = $token; } } my $alt = join '|', sort { length($b) <=> length($a) } map fc, keys %words2ids; my $re = qr{(?i:($alt))}; my %found; { open my $in_fh, '<', "$corpusname.txt"; open my $out_fh, '>', "$corpusname.possible-annotation.txt"; while (<$in_fh>) { s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg; print $out_fh $_; } } delete @words2ids{keys %found}; { open my $fh, '>', "$corpusname.tags-not-found.txt"; for (sort keys %words2ids) { say $fh "$_\t$words2ids{$_}"; } }

    This produces

    $ cat pm_11148202.possible-annotation.txt Lokho udebe <ZUL-SIL-0016-n> kukwenze isilomo. Ukuzihlola izinyo <ZUL-SIL-0018-n> kungahlenga izinyo lomhlathi <ZUL-S +IL-0019-n> yakho. Amakhala agxiza amafinyila. Ulimi <ZUL-SIL-0017-n> amafutha ulimi <ZUL-SIL-0017-n> wonke ULIMI <ZU +L-SIL-0017-n> amabheringi. Sebenzisa amafutha ulimi <ZUL-SIL-0017-n>. Zama ukugwema ukudla okuncinca udebe <ZUL-SIL-0016-n>.

    and

    $ cat pm_11148202.tags-not-found.txt ingemuva lomqala <ZUL-SIL-0024-n> umphimbo <ZUL-SIL-0025-n>

    Notes:

    • I used the fc() function. This is preferred over uc() and lc() for case-insensitive canonicalisation; however, it does require Perl v5.16.
    • Manually coding I/O exception handling is tedious and error-prone. The autodie pragma does this work for you: I recommend its use.
    • Note how I've used anonymous blocks. Filehandles are only left open for as long as they are needed; Perl will close them for you at the end of their blocks.

    — Ken

      Hi Ken, thank you very much for the suggested improvements. I am going to have a look at all of them.

      I since realised I have another problem to solve though: My code only printed the last tag found for a token and not all of them, so if the same token were to appear more than once with a different ID for each occurence in the tagset, I would only get the last one in my output and not a list of all possible IDs assigned to that token.

      Any suggestions on how to handle this would be most appreciated.

        Instead of %words2ids having key-value pairs like

        word => 'id'

        they could perhaps be more like

        word => [qw{id1 id2}]

        I'm not really across the specifics of what you want. You should supply a sample tagset file, a sample input file, and the expected output using those two files.

        My best guess would be changing these lines:

        ... $words2ids{fc $text} = $token; ... s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg; ...

        to

        ... push @{$words2ids{fc $text}}, $token; ... s/$re/++$found{fc $1}, "$1 @{$words2ids{fc $1}}"/eg; ...

        You should try this for yourself. If you run into difficulties, put together an SSCCE and post it: along with the two sample files and the expected output, this will give us a much better chance of quickly resolving whatever problems you're encountering.

        — Ken

        TIMTOWTDI

        #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11148202 use warnings; use List::AllUtils qw( rev_nsort_by ); my $corpusfile = '/tmp/d.11148202.corpus'; # FIXME filename my $wordfile = '/tmp/d.11148202.words'; # FIXME filename my %words2ids; { local @ARGV = $wordfile; while( <> ) { my ($key, $value) = split /[\t\n]/; $words2ids{lc $key} .= " $value"; } } my $pat = do { local $" = '|'; qr/(@{[ map quotemeta, rev_nsort_by { length } keys %words2ids ]})/i}; my %found; { local @ARGV = $corpusfile; print s/\b$pat\K/ $found{lc $1}++; $words2ids{lc $1} /ger while <>; } delete @words2ids{ keys %found }; # not found local $, = "\n"; print '',"---------------- Not Found:", sort(keys %words2ids), '';

        Outputs:

        Lokho udebe <ZUL-SIL-0016-n> kukwenze isilomo. Ukuzihlola izinyo <ZUL-SIL-0018-n> <ZUL-SIL-0018-n-other> kungahlenga +izinyo lomhlathi <ZUL-SIL-0019-n> yakho. Amakhala agxiza amafinyila. Ulimi <ZUL-SIL-0017-n> amafutha ulimi <ZUL-SIL-0017-n> wonke ULIMI <ZU +L-SIL-0017-n> amabheringi. Sebenzisa amafutha ulimi <ZUL-SIL-0017-n>. Zama ukugwema ukudla okuncinca udebe <ZUL-SIL-0016-n>. ---------------- Not Found: ingemuva lomqala umphimbo

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11148202]
Approved by 1nickt
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2024-03-28 21:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found