Comparing / Searching through Hashes

SayWhat?! has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I need some help, please. What I would like to do, is the following: I have two text files, BilingualWordList.txt and FalseFriendsList.txt - the columns are seperated by tabs. They look like this:

# BilingualWorldList.txt (Les't call it FileA)

vriendelik    aardig
irriterend    vervelend
losieshuis    pension
eksamen    examen
goed    braaf
damwal    dam
water    water
rekenaar    computer
outoritêr    outoritaire
wêreld    wereld
alle    alle
word    worden
angesien    overwegende
erkenning    erkenning
afrigter    trainer

FalseFriendsList.txt (Let's call it FileB)

vriendelik    aardig
goed    braaf
damwal    dam
bruinmens    kleurling
kamera    fototoestel
jammer    sneu//spijten
japon    ochtendjas
losieshuis    pension
buffer    bumper
bruinmens    kleurling
brulpadda    brulkikker
jammerlik    zielig
buffer    bumper
irriterend    irritant//vervelend
kameelperd    giraf//giraffe
[download]

I want to take FileB, and with FileB search through FileA, looking for matches. The matches don't need to be 100% identical, though, because as you can see are these two entries almost exactly the same:

#FileA
irriterend    vervelend
#FileB
irriterend    irritant//vervelend
[download]

Thus far, my code looks like this:

#!/usr/bin/perl-w
use strict;
#use warnings;
use open ':utf8';

#open files
open (FALSEF, "<FalseFriendsList.txt");
open (BILWL, "<BilingualWordList.txt");

#declare hashes
my %falsef;
my %existingfalsefriend;

#while the FF input exists
while (<FALSEF>)
{
    #assign each line to $line
    my $line = $_;
    #chomp off the new line
    chomp $line;
    #increment $line
    $falsef{$line}++;
}

#declare variables
my $token;
my %hash;

#open output files
open (OUTPUT1, ">OutputFalseFriends.txt");
open (OUTPUT2, ">OutputUnsortedWordList.txt");

#while input is received
while (<BILWL>)
{
    #assign each line to $line
    my $line = $_;
    #chomp off the new line
    chomp $line;
    
    #assign $line to the array
    my @wordlist = split/\t/,$line;
    
    #a for-loop to 'clean up' the words, to get rid of all the commas,
+ full stops, etc, except the apstrophes and hyphens
    for (my $x = 0; $x <= $#wordlist; $x++)
    {
        my $token = $wordlist[$x];
        
        if ($token =~ /(['\-\w]+)/)
        {
            #$word is now clean
            my $searchword = $1;
            
            #checks to see whether the word exists in the false friend
+s list
            if (exists $hash{$searchword} || exists $falsef{$searchwor
+d})
            {
                $existingfalsefriend{$searchword}++;
                
            }    
            else
            {    
                #print to unsorted.txt
                print OUTPUT2 "$searchword\n";
            }
        }
    }
}

my $searchword;

foreach my $searchword(sort keys %existingfalsefriend)
{
    #sorts the matched words alphabetically
    my $value = $existingfalsefriend{$searchword};
    print OUTPUT1 "$searchword\t $value\n";
}
[download]

However, my output does not look like I want it to look. I want the matching lines to be written to OutputFalseFriends.txt, and the non-matching lines to be written to OutputUnsortedWorldList.txt, like this:

#OutputFalseFriends.txt
vriendelik    aardig
losieshuis    pension
goed    braaf
damwal    dam
irriterend    irritant//vervelend

#OutputUnsortedWorldList.txt
eksamen    examen
water    water
rekenaar    computer
outoritêr    outoritaire
wêreld    wereld
alle    alle
word    worden
angesien    overwegende
erkenning    erkenning
afrigter    trainer
[download]

But OutputFalseFriends.txt is empty every time and OutputUnsortedWordList.txt contains my whole inputfile BilingualWorldList.txt, just with every word on its own line. A sample is shown here:

goed
braaf
naak
bloot
damwal
dam
kombers
deken
homoseksueel
flikker
bronstig
geil
munisipaliteit
gemeente
[download]

Does anyone have any advice on how I can correct this, please?

!!!!!!!!!!!!!! UPDATE !!!!!!!!!!!!!!

I finally got my program to do what I wanted it to do! (Well, part 1 of the whole program I'm trying to code, that is.. :p) Here is my code (same input as before, obviously)

#!/usr/bin/perl-w
use strict;
use warnings;
use open ':utf8';
use autodie;

#open FILE B
open (FALSEFRIENDINPUT, "<SNonCognatesAndFF.txt");

#declare hash
my %fileb;

#get a line from   #FILEB
while (my $line = <FALSEFRIENDINPUT>)
{
    #chomp off the new line
    chomp $line;
    
    # split the line on tab
    my ($filebkeys, $filebvalues) = split /\t/, $line;
    $fileb{$filebkeys} = $filebvalues;
    
        #open output files
        open (OUTPUT1, ">OutputMatchedFalseFriends.txt");
        open (OUTPUT2, ">OutputNonMatchedWords.txt");

        #open FILE A
        open (BILINGUALWL, "<BilingualWordList.1.0.0.IW.2012-06-20.txt
+");

        my %filea;    

        #get a line from   #FILEA
        while (my $line = <BILINGUALWL> )
        {
            chomp $line;
            #split the line on tab
            my ($fileakeys, $fileavalues) = split /\t/, $line;   
        
            #do first columns match?
            if ($fileb{$fileakeys})                         
            {
                #does the second column value contain the other as a s
+ubstring?
                if ($fileb{$fileakeys} =~ /$fileavalues/ or $fileavalu
+es =~ /$fileb{$fileakeys}/)
                {    
                    #if yes, print it to OutputMatchedFalseFriends.txt
                    print OUTPUT1 "$line\n";
                    #loop to the next line
                    next;                                 
                }
            }
            else
            {
                #if not, print it to OutputNonMatchedWords.txt
                print OUTPUT2 "$line\n";
            }
        }
}
[download]

And here is my output:

#OutputMatchedFalseFriends.txt

damwal    dam
bitsig    vinnig
bot    been
dikwels    vaak
aantreklik    knap
bees    rund
baas    chef
bestuur    directie
alles    alles
afrigter    trainer

#OutputNonMatchedWords.txt (only a sample of a 73 line output)

vriendelik    aardig
polisieman    agent
net-net    amper
gedierte    beest
goed    braaf
naak    bloot
kombers    deken
homoseksueel    flikker
bronstig    geil
munisipaliteit    gemeente
menskop    hoofd
toedraai    inpakken
kiestand    kies
dierekop    kop
[download]

I have only one question now, though.. Sometimes (and quite randomly) when I run my program, I get the following messages (only one at a time, on a rotating basis):

Can't open '>MatchedFalseFriends.txt' for writing: 'Invalid argument' 
+at Script.ExtractionofCognates.1.0.5.2012.06.28.pl  line 25

#and
Can't open '>OutputNonMatchedWords.txt' for writing: 'Invalid argument
+' at Script.ExtractionofCognates.1.0.5.2012.06.28.pl  line 25
[download]

<Why would this be? Does anyone know, maybe? But it doesn't hamper the output in any way..

Oh, and a big thank you to everyone for their input, code examples and opinions - especially Athanasius and aaron_baugher. I appreciate it. :)

Comment on Comparing / Searching through Hashes Select or Download Code

Replies are listed 'Best First'.
Re: Comparing / Searching through Hashes by flexvault (Monsignor) on Jun 28, 2012 at 13:12 UTC
Welcome SayWhat?!, Rather than debug your script for you, I hope I can make some suggestions for you to learn to better debug your scripts. Note: these are suggestions and after some years of doing your own debugging your suggestions could be very different than mine. If your Perl supports it, use the 3 parameter version of open: `open (my $FALSEF, "<", "FalseFriendsList.txt");` [download] I like how you name your files, and you can do the same with Perl variables, i.e. `my %ExistingFalseFriend; # or my %existing_false_friend; # or my %Existing_False_Friend;` [download] This will help your eyes see exactly what you wanted to emphasize when you look at the code sometime in the future. Also, I define a $Debug variable that I set to the level of debugging `my $Debug = 3; # 0 - production, 1 - minor debugging, 2 .. 9 for + different levels` [download] This leads to what is needed most in your script -- self-help debugging information. For example why not use a 'foreach' on the hash(or array) to see exactly what you just created. `#while the FF input exists while (my $line = <FALSEF> ) { # chomp off the new line chomp $line; # chomp could be part of while statement # increment $line $falsef{$line}++; } ### Now during debugging print to open log file if ( $Debug ) # this could be 'if ( 1==1 )' for testing, your + call { foreach my $key ( sort keys $falsef ) { print $LOG "$key\t$falsef{$key} } }` [download] Now in your debug log file, you may see that you didn't initialize the hash the way you wanted. Again, my suggestion of 'foreach' could have been replaced by a print statement inside the 'while' loop. In general, I like your style and I'm sure it will get better and better. Good Luck...Your on your way! "Well done is better than well said." - Benjamin Franklin	[reply] [d/l] [select]
Re: Comparing / Searching through Hashes by ww (Archbishop) on Jun 28, 2012 at 12:34 UTC
Compare to the answers to your question in Comparing arrays	[reply]
Re^2: Comparing / Searching through Hashes by SayWhat?! (Novice) on Jun 28, 2012 at 12:48 UTC
I have already done that and added the changes like suggested, but I'm still stuck. Someone called Athanasius suggested that I should start a new question in which I clarify what I would like my program to do. So that's what I did. :) I personally think this question explains in more detail than my previous one, though..	[reply]
Re^3: Comparing / Searching through Hashes by Athanasius (Archbishop) on Jun 28, 2012 at 14:00 UTC
Hello again, SayWhat?!, Good job on clarifying the question. I’m getting some idea of what you want to achieve. However, from your statement: I want the matching lines to be written to OutputFalseFriends.txt, and the non-matching lines to be written to OutputUnsortedWorldList.txt I would not expect the second output file to contain the line `goed braaf`, since this appears in both the input files. Is this a mistake, or am I missing something? (Incidentally, a large part of ‘programming’ is really sorting out requirements, independently of the actual code. This is just something we all need to get used to.) I think that, if you re-examine your code in light of your sample input, you will see that the requirements have evolved. For example, as RichardK observes, there are no commas, etc., to be cleaned up. Also, `%hash` is not being used for anything. Rather than fix your main loop — the loop beginning `while (<BILWL>)` — it will probably be easier if you re-think the logic of what you are doing and re-write this part from scratch. flexvault has given some excellent advice. In addition, it will help you if you include the line: `use autodie;` [download] near the top of your script, as this will tell you when files cannot be opened, etc. Also, if you: `use Data::Dumper;` [download] this will make the debugging task easier. (`Data::Dumper` is a core module, so it will already be in your Perl installation. See http://perldoc.perl.org/Data/Dumper.html for details.) For example, you can print the contents of `%falsef` with just: `print Dumper(\%falsef);` [download] You’re making progress! Athanasius <°(((>< contra mundum	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Comparing / Searching through Hashes by aaron_baugher (Curate) on Jun 28, 2012 at 13:46 UTC
Your requirements aren't quite clear (at least to me). You say lines don't have to match 100%, but then, how exactly do they have to match? Is it enough for the first columns to match, or does the second column have to match partly in some way? In the example you give (repeated below), the first column values are identical, but in the second column, one is a substring of the other, and your attempt to tokenize makes me think you might be trying to match more than just the first column. Can you clarify? `# these should match, but why exactly? #FileA irriterend vervelend #FileB irriterend irritant//vervelend` [download] Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l]
Re: Comparing / Searching through Hashes by RichardK (Parson) on Jun 28, 2012 at 13:01 UTC
You could always step through you code with the debugger 'perl -d' to see what it's really doing. But a first glance you don't actually clean up the word list as your comment suggests What do you think this does? `for (my $x = 0; $x <= $#wordlist; $x++) { my $token = $wordlist[$x]; if ($token =~ /(['\-\w]+)/)` [download]	[reply] [d/l]
Re: Comparing / Searching through Hashes by SayWhat?! (Novice) on Jun 28, 2012 at 16:43 UTC
Hello to you too, Athanasius! :) I'm afraid that the "goed braaf" duplicate is a mistake that slipped in by means of copying and pasting... I have already corrected it. I have to thank you though, you're giving me such helpful advice - in a language I can understand! I hope you will continue doing so, if you don't mind.. :) aaron_baugher The first columns definitely have to match and the second column has to match partly, yes. The reason for this is because they are false friends of one another. In the first column we have the Afrikaans word 'irriterend'. In Dutch, the word for 'irriterend' can either be 'irritant OR vervelend' Like in English and German: 'Chef' in English means ‘someone who cooks for a living’ but a 'Chef' in German in the 'director or boss of a company'.. Does that make better sense now? I don’t really know how else to explain it, I'm afraid.. Thanks for all the advice, though! You're really helping me to make progress! :) And I would also like to apologise to anyone that I might have annoyed by posting a new question.. I'm still new at this whole posting thing. Hopefully it will improve just like my coding hopelfully will!	[reply]
Re^2: Comparing / Searching through Hashes by aaron_baugher (Curate) on Jun 28, 2012 at 21:15 UTC
Thanks for the clarification. In that case, I think you're on the right track: put fileB in a hash, then go through fileA checking each key from fileA for existence as a key in fileB. That's the standard idiom for this kind of thing, but in your case there will be the extra step that once you find a match on the keys from the first column, you'll also need to check for a match on the second column. That might look something like the code below (untested). The tricky part may be that inner `if` comparison. In mine, I'm just testing to see if either value is found as a substring in the other. If you need something more sophisticated, you'll have to adjust that there. # %b is a hash already containing the values from fileB, with the # first column as keys and the second column as values. # $file_of_matches is a file descriptor opened to one output file # $file_of_misses is a file descriptor for the other output file open my $fileA, '<', 'fileA' or die $!; while( my $line = <$fileA> ){ # get a line from fileA chomp $line; my( $k, $v ) = split /\t/, $line; # split the line on tab if( $b{$k} ){ # do first columns match? if( $b{$k} =~ /$v/ or $v =~ /$b{$k}/ ){ # does one second column v +alue contain # the other as a substring +? print $file_of_matches "$line\n"; # yes, so print it to the +match file next; # and loop to the next lin +e } } print $file_of_misses "$line\n"; # no, so print it to the n +on-match file } [download] By the way, note that this: `while( my $line = <$fileA> ){ # do stuff with $line # replaces this: while( <$fileA> ){ my $line = $_; # do stuff with $line` [download] It saves a line and avoids potential bugs that may be caused by using $_ sort of halfway. Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l] [select]