Comparing arrays

SayWhat?! has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I asked this question in the Chatterbox, and although someone gave me a very good answer, I’m still stuck.. I have two text files – both of them comprising of a tab delimited Afrikaans and Dutch bilingual word list. Eg.: BilingualWorldList.txt net-net amper gedierte beest wêreld wereld alle alle regte rechten and: FalseFriends.txt 'n een Augustus augustus Christelik christelijk afdraaipad afrit//afslag afdraende helling What I would like to do, is the following: Enter both files into separate arrays

#!/usr/bin/perl-w
use strict;
use warnings;
use open ':utf8';

open (INPUT1, "<FalseFriends.txt");
open (INPUT2, "<BilingualWordList.txt");

while (<INPUT1>)
{
    my $line = $_;
    chomp $line;
    my @words = $line;
}

while (<INPUT2>)
{
    my $line2 = $_;
    chomp $line2;
    my @words2 = $line2;
}
[download]

Now I want to loop through both arrays and see if there are any matches between the two arrays. However, the match does not need to be 100%. Eg.: In BilingualWordList.txt, there could be something like: “beddinkie bedjie”, and in FalseFriends.txt, there could be: “beddinkie bedjie//perkje”. Thus, they would be a match (or a partial match, if you’d like). Or you could get “kombers deken” in both files, and they would also be a match. I tried the loop like this, but the only results I get are a bunch of 0’s. Why is this?

open (OUTPUT1, ">FF.txt");
open (OUTPUT2, ">Unmatched.txt");

    for (my $falsefriend = 0; $ falsefriend <= $#words; $ falsefriend 
+++)
    {
        for (my $bilingualword = 0; $ bilingualword <= $#words; $ bili
+ngualword++)
        {
            if ($falsefriend eq $ bilingualword)
            {
                print OUTPUT1 “$falsefriend"\n";
            }
            else
{
                print OUTPUT2 "$bilingualword\n";
            }
        }
    }
[download]

Now I want to take Unmatched.txt and sort it into a hash, so that the Afrikaans words (column1) would be the keys, and the Dutch words (column2) the values. I then want to compare the keys to the values. If there is a 100% match, both the key and the value need to be written to IdenticalCognates.txt. How would I go about to do this?

UPDATE!!!

Hello again! Thank you so much for your responses.. I tried a few things during the day, and decided on using hashes instead. I wrote this piece of code, which is supposed to compare the two input files. It executes - thus no real errors - but the Output is not what I longed for. The output is supposed to be a FalseFriends.txt file and an Unsorted.txt file. However, when the code is executed, I only get data in Unsorted.txt. And the data is exactly the first column of my BilingualWorldList input file. What am I doing wrong? Could anyone help me out, plese?

#!/usr/bin/perl-w
use strict;
#use warnings;
use open ':utf8';

#open files
open (FALSEF, "<SNonCognatesAndFF.txt");
open (BILWL, "<BilingualWordList.1.0.0.IW.2012-06-20.txt");

#declare hashes
my %falsef;
my %existingfalsefriend;

#while the FF input exists
while (<FALSEF>)
{
    #assign each line to $line
    my $line = $_;
    #chomp off the new line
    chomp $line;
    #increment $line
    $falsef{$line}++;
}

#declare variables
my $token;
my %hash;

#open output files
open (OUTPUT1, ">YayOutputFalseFriends.txt");
open (OUTPUT2, ">AhhUnsortedWordList.txt");

#while input is received
while (<BILWL>)
{
    #assign each line to $line
    my $line = $_;
    #chomp off the new line
    chomp $line;
    
    #assign $line to the array
    my @wordlist = $line; #split/\t/, $line;
    
    #a for-loop to 'clean up' the words, to get rid of all the commas,
+ full stops, etc, except the apstrophes and hyphens
    for (my $x = 0; $x <= $#wordlist; $x++)
    {
        my $token = $wordlist[$x];
        
        if ($token =~ /('?\w+)/)
        {
            #$word is now clean
            my $searchword = $1;
            
            #checks to see whether the word exists in the false friend
+s list
            if (exists $hash{$searchword} || exists $falsef{$searchwor
+d})
            {
                my $existingfalsefriend;
                $existingfalsefriend{$searchword}++;
                
            }    
            else
            {    
                #print to unsorted.txt
                print OUTPUT2 "$searchword\n";
            }
        }
    }
}

my $searchword;

foreach my $searchword(sort keys %existingfalsefriend)
{
    #sorts the matched words alphabetically
    my $value = $existingfalsefriend{$searchword};
    print OUTPUT1 "$searchword\t $value\n";
}
[download]

Comment on Comparing arrays Select or Download Code

Replies are listed 'Best First'.

Re: Comparing arrays
by McA (Priest) on Jun 27, 2012 at 08:39 UTC

to your first question: "Why is this?"

It's because you compare the indexes of the arrays and not the values of the arrays. Your code:

if ($falsefriend eq $ bilingualword)
            {
                print OUTPUT1 “$falsefriend"\n";
            }
[download]

But there are some more errors:
a) The upper bound of the two loops is always the length of array words whereas one of them should be words2.
b) You declare the arrays words and words2 with my in a block. So they are lexically scoped. Afterwards you want to access them. If you like to do this, you have to declare them outside of the filling block.
c) You have done an exact string compare with 'eq' and not a substring match what you want to achieve. So compare the strings with the match operator.

[reply]
[d/l]

Re: Comparing arrays
by zeni (Beadle) on Jun 27, 2012 at 09:27 UTC

<i>
while (<INPUT1>)
{
    my $line = $_;
    chomp $line;
    my @words = $line;
}

while (<INPUT2>)
{
    my $line2 = $_;
    chomp $line2;
    my @words2 = $line2;
}
</i>
[download]

can be replaced with:

my @words1 = <INPUT1>
my @words2 = <INPUT2>
[download]

Coming to matching strings you can use ^(match at beginning) and $(match at end of string) according to requirement. so that partial matching is done.

Life is a box of chocolates.. Perhaps you get to eat very few best ones!!

[reply]
[d/l]
[select]

Re^2: Comparing arrays

by zeni (Beadle) on Jun 27, 2012 at 09:34 UTC

Btw where is your regular expression(code) to match strings?

[reply]

Re^2: Comparing arrays

by muba (Priest) on Jun 27, 2012 at 12:28 UTC

Can it?

use strict;
use warnings;

my @words1 = <INPUT1>
my @words2 = <INPUT2>

__END__
syntax error at G:\x.pl line 5, near "my "
Global symbol "@words2" requires explicit package name at G:\x.pl line
+ 5.
Execution of G:\x.pl aborted due to compilation errors.
[download]

It's not even complaining about trying to read from unopened filehandles, because the error occurs at compile time - even before perl realizes that the read operation would fail. The solution is obvious - you forgot a semicolon. But even then, will it hold?

use strict;
use warnings;

open INPUT1, "<", "x.pl";
my @words1 = <INPUT1>;
print "<$_>\n" for @words1;

__END__
<use strict;
>
<use warnings;
>
<
>
<open INPUT1, "<", "x.pl";
>
<my @words1 = <INPUT1>;
>
<print "<$_>\n" for @words1;>
[download]

That doesn't look right. And no wonder - your replacement doesn't do the chomping as it happens in OP's code.

Surely you meant

my @words1 = map {chomp; $_} <INPUT1>;
my @words2 = map {chomp; $_} <INPUT2>;
[download]

[reply]
[d/l]
[select]

Re: Comparing arrays
by Athanasius (Archbishop) on Jun 28, 2012 at 07:40 UTC

Note: This is a reply to the UPDATED code added to the original post.

A few quick observations:

my @wordlist = $line; #split/\t/, $line;
[download]

The split is needed here. As it is, this just assigns the whole line to $wordlist[0].

if ($token =~ /('?\w+)/)
[download]

This regular expression needs to be written as a character class, and suitably quantified:

if ($token =~ /(['\-\w]+)/)
[download]

See http://perldoc.perl.org/perlre.html and http://perldoc.perl.org/perlretut.html.

my $existingfalsefriend;
$existingfalsefriend{$searchword}++;
[download]

The second statement assigns to the hash %existingfalsefriend which was declared above. The first statement declares a scalar variable which happens to have the same name but is otherwise unrelated to the hash variable. So, in this context, the first statement does nothing and is unnecessary.

If, after fixing your code, you need further help, I suggest you:

re-post in a new node (or ask a new SoPW question);
give an excerpt of each input file (formatted between <code> tags) showing its format and contents; and
show a sample of the output files (similarly formatted) that you wish to generate from your given input.

This will make it easier for the monks to help you.

HTH,

Athanasius <°(((>< contra mundum

[reply]
[d/l]
[select]