comment on

Hello! I need some help, please. What I would like to do, is the following: I have two text files, BilingualWordList.txt and FalseFriendsList.txt - the columns are seperated by tabs. They look like this:

# BilingualWorldList.txt (Les't call it FileA)

vriendelik    aardig
irriterend    vervelend
losieshuis    pension
eksamen    examen
goed    braaf
damwal    dam
water    water
rekenaar    computer
outoritêr    outoritaire
wêreld    wereld
alle    alle
word    worden
angesien    overwegende
erkenning    erkenning
afrigter    trainer

FalseFriendsList.txt (Let's call it FileB)

vriendelik    aardig
goed    braaf
damwal    dam
bruinmens    kleurling
kamera    fototoestel
jammer    sneu//spijten
japon    ochtendjas
losieshuis    pension
buffer    bumper
bruinmens    kleurling
brulpadda    brulkikker
jammerlik    zielig
buffer    bumper
irriterend    irritant//vervelend
kameelperd    giraf//giraffe
[download]

I want to take FileB, and with FileB search through FileA, looking for matches. The matches don't need to be 100% identical, though, because as you can see are these two entries almost exactly the same:

#FileA
irriterend    vervelend
#FileB
irriterend    irritant//vervelend
[download]

Thus far, my code looks like this:

#!/usr/bin/perl-w
use strict;
#use warnings;
use open ':utf8';

#open files
open (FALSEF, "<FalseFriendsList.txt");
open (BILWL, "<BilingualWordList.txt");

#declare hashes
my %falsef;
my %existingfalsefriend;

#while the FF input exists
while (<FALSEF>)
{
    #assign each line to $line
    my $line = $_;
    #chomp off the new line
    chomp $line;
    #increment $line
    $falsef{$line}++;
}

#declare variables
my $token;
my %hash;

#open output files
open (OUTPUT1, ">OutputFalseFriends.txt");
open (OUTPUT2, ">OutputUnsortedWordList.txt");

#while input is received
while (<BILWL>)
{
    #assign each line to $line
    my $line = $_;
    #chomp off the new line
    chomp $line;
    
    #assign $line to the array
    my @wordlist = split/\t/,$line;
    
    #a for-loop to 'clean up' the words, to get rid of all the commas,
+ full stops, etc, except the apstrophes and hyphens
    for (my $x = 0; $x <= $#wordlist; $x++)
    {
        my $token = $wordlist[$x];
        
        if ($token =~ /(['\-\w]+)/)
        {
            #$word is now clean
            my $searchword = $1;
            
            #checks to see whether the word exists in the false friend
+s list
            if (exists $hash{$searchword} || exists $falsef{$searchwor
+d})
            {
                $existingfalsefriend{$searchword}++;
                
            }    
            else
            {    
                #print to unsorted.txt
                print OUTPUT2 "$searchword\n";
            }
        }
    }
}

my $searchword;

foreach my $searchword(sort keys %existingfalsefriend)
{
    #sorts the matched words alphabetically
    my $value = $existingfalsefriend{$searchword};
    print OUTPUT1 "$searchword\t $value\n";
}
[download]

However, my output does not look like I want it to look. I want the matching lines to be written to OutputFalseFriends.txt, and the non-matching lines to be written to OutputUnsortedWorldList.txt, like this:

#OutputFalseFriends.txt
vriendelik    aardig
losieshuis    pension
goed    braaf
damwal    dam
irriterend    irritant//vervelend

#OutputUnsortedWorldList.txt
eksamen    examen
water    water
rekenaar    computer
outoritêr    outoritaire
wêreld    wereld
alle    alle
word    worden
angesien    overwegende
erkenning    erkenning
afrigter    trainer
[download]

But OutputFalseFriends.txt is empty every time and OutputUnsortedWordList.txt contains my whole inputfile BilingualWorldList.txt, just with every word on its own line. A sample is shown here:

goed
braaf
naak
bloot
damwal
dam
kombers
deken
homoseksueel
flikker
bronstig
geil
munisipaliteit
gemeente
[download]

Does anyone have any advice on how I can correct this, please?

!!!!!!!!!!!!!! UPDATE !!!!!!!!!!!!!!

I finally got my program to do what I wanted it to do! (Well, part 1 of the whole program I'm trying to code, that is.. :p) Here is my code (same input as before, obviously)

#!/usr/bin/perl-w
use strict;
use warnings;
use open ':utf8';
use autodie;

#open FILE B
open (FALSEFRIENDINPUT, "<SNonCognatesAndFF.txt");

#declare hash
my %fileb;

#get a line from   #FILEB
while (my $line = <FALSEFRIENDINPUT>)
{
    #chomp off the new line
    chomp $line;
    
    # split the line on tab
    my ($filebkeys, $filebvalues) = split /\t/, $line;
    $fileb{$filebkeys} = $filebvalues;
    
        #open output files
        open (OUTPUT1, ">OutputMatchedFalseFriends.txt");
        open (OUTPUT2, ">OutputNonMatchedWords.txt");

        #open FILE A
        open (BILINGUALWL, "<BilingualWordList.1.0.0.IW.2012-06-20.txt
+");

        my %filea;    

        #get a line from   #FILEA
        while (my $line = <BILINGUALWL> )
        {
            chomp $line;
            #split the line on tab
            my ($fileakeys, $fileavalues) = split /\t/, $line;   
        
            #do first columns match?
            if ($fileb{$fileakeys})                         
            {
                #does the second column value contain the other as a s
+ubstring?
                if ($fileb{$fileakeys} =~ /$fileavalues/ or $fileavalu
+es =~ /$fileb{$fileakeys}/)
                {    
                    #if yes, print it to OutputMatchedFalseFriends.txt
                    print OUTPUT1 "$line\n";
                    #loop to the next line
                    next;                                 
                }
            }
            else
            {
                #if not, print it to OutputNonMatchedWords.txt
                print OUTPUT2 "$line\n";
            }
        }
}
[download]

And here is my output:

#OutputMatchedFalseFriends.txt

damwal    dam
bitsig    vinnig
bot    been
dikwels    vaak
aantreklik    knap
bees    rund
baas    chef
bestuur    directie
alles    alles
afrigter    trainer

#OutputNonMatchedWords.txt (only a sample of a 73 line output)

vriendelik    aardig
polisieman    agent
net-net    amper
gedierte    beest
goed    braaf
naak    bloot
kombers    deken
homoseksueel    flikker
bronstig    geil
munisipaliteit    gemeente
menskop    hoofd
toedraai    inpakken
kiestand    kies
dierekop    kop
[download]

I have only one question now, though.. Sometimes (and quite randomly) when I run my program, I get the following messages (only one at a time, on a rotating basis):

Can't open '>MatchedFalseFriends.txt' for writing: 'Invalid argument' 
+at Script.ExtractionofCognates.1.0.5.2012.06.28.pl  line 25

#and
Can't open '>OutputNonMatchedWords.txt' for writing: 'Invalid argument
+' at Script.ExtractionofCognates.1.0.5.2012.06.28.pl  line 25
[download]

<Why would this be? Does anyone know, maybe? But it doesn't hamper the output in any way..

Oh, and a big thank you to everyone for their input, code examples and opinions - especially Athanasius and aaron_baugher. I appreciate it. :)

In reply to Comparing / Searching through Hashes by SayWhat?!

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.