SayWhat?! has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I need some help, please. What I would like to do, is the following: I have two text files, BilingualWordList.txt and FalseFriendsList.txt - the columns are seperated by tabs. They look like this:

# BilingualWorldList.txt (Les't call it FileA) vriendelik aardig irriterend vervelend losieshuis pension eksamen examen goed braaf damwal dam water water rekenaar computer outoritęr outoritaire węreld wereld alle alle word worden angesien overwegende erkenning erkenning afrigter trainer FalseFriendsList.txt (Let's call it FileB) vriendelik aardig goed braaf damwal dam bruinmens kleurling kamera fototoestel jammer sneu//spijten japon ochtendjas losieshuis pension buffer bumper bruinmens kleurling brulpadda brulkikker jammerlik zielig buffer bumper irriterend irritant//vervelend kameelperd giraf//giraffe

I want to take FileB, and with FileB search through FileA, looking for matches. The matches don't need to be 100% identical, though, because as you can see are these two entries almost exactly the same:

#FileA irriterend vervelend #FileB irriterend irritant//vervelend

Thus far, my code looks like this:

#!/usr/bin/perl-w use strict; #use warnings; use open ':utf8'; #open files open (FALSEF, "<FalseFriendsList.txt"); open (BILWL, "<BilingualWordList.txt"); #declare hashes my %falsef; my %existingfalsefriend; #while the FF input exists while (<FALSEF>) { #assign each line to $line my $line = $_; #chomp off the new line chomp $line; #increment $line $falsef{$line}++; } #declare variables my $token; my %hash; #open output files open (OUTPUT1, ">OutputFalseFriends.txt"); open (OUTPUT2, ">OutputUnsortedWordList.txt"); #while input is received while (<BILWL>) { #assign each line to $line my $line = $_; #chomp off the new line chomp $line; #assign $line to the array my @wordlist = split/\t/,$line; #a for-loop to 'clean up' the words, to get rid of all the commas, + full stops, etc, except the apstrophes and hyphens for (my $x = 0; $x <= $#wordlist; $x++) { my $token = $wordlist[$x]; if ($token =~ /(['\-\w]+)/) { #$word is now clean my $searchword = $1; #checks to see whether the word exists in the false friend +s list if (exists $hash{$searchword} || exists $falsef{$searchwor +d}) { $existingfalsefriend{$searchword}++; } else { #print to unsorted.txt print OUTPUT2 "$searchword\n"; } } } } my $searchword; foreach my $searchword(sort keys %existingfalsefriend) { #sorts the matched words alphabetically my $value = $existingfalsefriend{$searchword}; print OUTPUT1 "$searchword\t $value\n"; }

However, my output does not look like I want it to look. I want the matching lines to be written to OutputFalseFriends.txt, and the non-matching lines to be written to OutputUnsortedWorldList.txt, like this:

#OutputFalseFriends.txt vriendelik aardig losieshuis pension goed braaf damwal dam irriterend irritant//vervelend #OutputUnsortedWorldList.txt eksamen examen water water rekenaar computer outoritęr outoritaire węreld wereld alle alle word worden angesien overwegende erkenning erkenning afrigter trainer

But OutputFalseFriends.txt is empty every time and OutputUnsortedWordList.txt contains my whole inputfile BilingualWorldList.txt, just with every word on its own line. A sample is shown here:

goed braaf naak bloot damwal dam kombers deken homoseksueel flikker bronstig geil munisipaliteit gemeente

Does anyone have any advice on how I can correct this, please?

!!!!!!!!!!!!!! UPDATE !!!!!!!!!!!!!!

I finally got my program to do what I wanted it to do! (Well, part 1 of the whole program I'm trying to code, that is.. :p) Here is my code (same input as before, obviously)

#!/usr/bin/perl-w use strict; use warnings; use open ':utf8'; use autodie; #open FILE B open (FALSEFRIENDINPUT, "<SNonCognatesAndFF.txt"); #declare hash my %fileb; #get a line from #FILEB while (my $line = <FALSEFRIENDINPUT>) { #chomp off the new line chomp $line; # split the line on tab my ($filebkeys, $filebvalues) = split /\t/, $line; $fileb{$filebkeys} = $filebvalues; #open output files open (OUTPUT1, ">OutputMatchedFalseFriends.txt"); open (OUTPUT2, ">OutputNonMatchedWords.txt"); #open FILE A open (BILINGUALWL, "<BilingualWordList.1.0.0.IW.2012-06-20.txt +"); my %filea; #get a line from #FILEA while (my $line = <BILINGUALWL> ) { chomp $line; #split the line on tab my ($fileakeys, $fileavalues) = split /\t/, $line; #do first columns match? if ($fileb{$fileakeys}) { #does the second column value contain the other as a s +ubstring? if ($fileb{$fileakeys} =~ /$fileavalues/ or $fileavalu +es =~ /$fileb{$fileakeys}/) { #if yes, print it to OutputMatchedFalseFriends.txt print OUTPUT1 "$line\n"; #loop to the next line next; } } else { #if not, print it to OutputNonMatchedWords.txt print OUTPUT2 "$line\n"; } } }

And here is my output:

#OutputMatchedFalseFriends.txt damwal dam bitsig vinnig bot been dikwels vaak aantreklik knap bees rund baas chef bestuur directie alles alles afrigter trainer #OutputNonMatchedWords.txt (only a sample of a 73 line output) vriendelik aardig polisieman agent net-net amper gedierte beest goed braaf naak bloot kombers deken homoseksueel flikker bronstig geil munisipaliteit gemeente menskop hoofd toedraai inpakken kiestand kies dierekop kop

I have only one question now, though.. Sometimes (and quite randomly) when I run my program, I get the following messages (only one at a time, on a rotating basis):

Can't open '>MatchedFalseFriends.txt' for writing: 'Invalid argument' +at Script.ExtractionofCognates.1.0.5.2012.06.28.pl line 25 #and Can't open '>OutputNonMatchedWords.txt' for writing: 'Invalid argument +' at Script.ExtractionofCognates.1.0.5.2012.06.28.pl line 25
<Why would this be? Does anyone know, maybe? But it doesn't hamper the output in any way..

Oh, and a big thank you to everyone for their input, code examples and opinions - especially Athanasius and aaron_baugher. I appreciate it. :)

Replies are listed 'Best First'.
Re: Comparing / Searching through Hashes
by flexvault (Monsignor) on Jun 28, 2012 at 13:12 UTC

    Welcome SayWhat?!,

    Rather than debug your script for you, I hope I can make some suggestions for you to learn to better debug your scripts. Note: these are suggestions and after some years of doing your own debugging your suggestions could be very different than mine.

    If your Perl supports it, use the 3 parameter version of open:

    open (my $FALSEF, "<", "FalseFriendsList.txt");
    I like how you name your files, and you can do the same with Perl variables, i.e.
    my %ExistingFalseFriend; # or my %existing_false_friend; # or my %Existing_False_Friend;
    This will help your eyes see exactly what you wanted to emphasize when you look at the code sometime in the future. Also, I define a $Debug variable that I set to the level of debugging
    my $Debug = 3; # 0 - production, 1 - minor debugging, 2 .. 9 for + different levels

    This leads to what is needed most in your script -- self-help debugging information. For example why not use a 'foreach' on the hash(or array) to see exactly what you just created.

    #while the FF input exists while (my $line = <FALSEF> ) { # chomp off the new line chomp $line; # chomp could be part of while statement # increment $line $falsef{$line}++; } ### Now during debugging print to open log file if ( $Debug ) # this could be 'if ( 1==1 )' for testing, your + call { foreach my $key ( sort keys $falsef ) { print $LOG "$key\t$falsef{$key} } }
    Now in your debug log file, you may see that you didn't initialize the hash the way you wanted. Again, my suggestion of 'foreach' could have been replaced by a print statement inside the 'while' loop.

    In general, I like your style and I'm sure it will get better and better.

    Good Luck...Your on your way!

    "Well done is better than well said." - Benjamin Franklin

Re: Comparing / Searching through Hashes
by ww (Archbishop) on Jun 28, 2012 at 12:34 UTC

      I have already done that and added the changes like suggested, but I'm still stuck. Someone called Athanasius suggested that I should start a new question in which I clarify what I would like my program to do. So that's what I did. :) I personally think this question explains in more detail than my previous one, though..

        Hello again, SayWhat?!,

        Good job on clarifying the question. I’m getting some idea of what you want to achieve. However, from your statement:

        I want the matching lines to be written to OutputFalseFriends.txt, and the non-matching lines to be written to OutputUnsortedWorldList.txt

        I would not expect the second output file to contain the line goed braaf, since this appears in both the input files. Is this a mistake, or am I missing something?

        (Incidentally, a large part of ‘programming’ is really sorting out requirements, independently of the actual code. This is just something we all need to get used to.)

        I think that, if you re-examine your code in light of your sample input, you will see that the requirements have evolved. For example, as RichardK observes, there are no commas, etc., to be cleaned up. Also, %hash is not being used for anything. Rather than fix your main loop — the loop beginning while (<BILWL>) — it will probably be easier if you re-think the logic of what you are doing and re-write this part from scratch.

        flexvault has given some excellent advice. In addition, it will help you if you include the line:

        use autodie;

        near the top of your script, as this will tell you when files cannot be opened, etc. Also, if you:

        use Data::Dumper;

        this will make the debugging task easier. (Data::Dumper is a core module, so it will already be in your Perl installation. See http://perldoc.perl.org/Data/Dumper.html for details.) For example, you can print the contents of %falsef with just:

        print Dumper(\%falsef);

        You’re making progress!

        Athanasius <°(((><contra mundum

        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Comparing / Searching through Hashes
by aaron_baugher (Curate) on Jun 28, 2012 at 13:46 UTC

    Your requirements aren't quite clear (at least to me). You say lines don't have to match 100%, but then, how exactly do they have to match? Is it enough for the first columns to match, or does the second column have to match partly in some way? In the example you give (repeated below), the first column values are identical, but in the second column, one is a substring of the other, and your attempt to tokenize makes me think you might be trying to match more than just the first column. Can you clarify?

    # these should match, but why exactly? #FileA irriterend vervelend #FileB irriterend irritant//vervelend

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Comparing / Searching through Hashes
by RichardK (Parson) on Jun 28, 2012 at 13:01 UTC

    You could always step through you code with the debugger 'perl -d' to see what it's really doing.

    But a first glance you don't actually clean up the word list as your comment suggests

    What do you think this does?

    for (my $x = 0; $x <= $#wordlist; $x++) { my $token = $wordlist[$x]; if ($token =~ /(['\-\w]+)/)
Re: Comparing / Searching through Hashes
by SayWhat?! (Novice) on Jun 28, 2012 at 16:43 UTC

    Hello to you too, Athanasius! :)

    I'm afraid that the "goed braaf" duplicate is a mistake that slipped in by means of copying and pasting... I have already corrected it.

    I have to thank you though, you're giving me such helpful advice - in a language I can understand! I hope you will continue doing so, if you don't mind.. :)

    aaron_baugher The first columns definitely have to match and the second column has to match partly, yes. The reason for this is because they are false friends of one another.

    In the first column we have the Afrikaans word 'irriterend'. In Dutch, the word for 'irriterend' can either be 'irritant OR vervelend'

    Like in English and German: 'Chef' in English means ‘someone who cooks for a living’ but a 'Chef' in German in the 'director or boss of a company'.. Does that make better sense now? I don’t really know how else to explain it, I'm afraid..

    Thanks for all the advice, though! You're really helping me to make progress! :)

    And I would also like to apologise to anyone that I might have annoyed by posting a new question.. I'm still new at this whole posting thing. Hopefully it will improve just like my coding hopelfully will!

      Thanks for the clarification. In that case, I think you're on the right track: put fileB in a hash, then go through fileA checking each key from fileA for existence as a key in fileB. That's the standard idiom for this kind of thing, but in your case there will be the extra step that once you find a match on the keys from the first column, you'll also need to check for a match on the second column. That might look something like the code below (untested). The tricky part may be that inner if comparison. In mine, I'm just testing to see if either value is found as a substring in the other. If you need something more sophisticated, you'll have to adjust that there.

      # %b is a hash already containing the values from fileB, with the # first column as keys and the second column as values. # $file_of_matches is a file descriptor opened to one output file # $file_of_misses is a file descriptor for the other output file open my $fileA, '<', 'fileA' or die $!; while( my $line = <$fileA> ){ # get a line from fileA chomp $line; my( $k, $v ) = split /\t/, $line; # split the line on tab if( $b{$k} ){ # do first columns match? if( $b{$k} =~ /$v/ or $v =~ /$b{$k}/ ){ # does one second column v +alue contain # the other as a substring +? print $file_of_matches "$line\n"; # yes, so print it to the +match file next; # and loop to the next lin +e } } print $file_of_misses "$line\n"; # no, so print it to the n +on-match file }

      By the way, note that this:

      while( my $line = <$fileA> ){ # do stuff with $line # replaces this: while( <$fileA> ){ my $line = $_; # do stuff with $line

      It saves a line and avoids potential bugs that may be caused by using $_ sort of halfway.

      Aaron B.
      Available for small or large Perl jobs; see my home node.