SayWhat?! has asked for the wisdom of the Perl Monks concerning the following question:
Hello! I need some help, please. What I would like to do, is the following: I have two text files, BilingualWordList.txt and FalseFriendsList.txt - the columns are seperated by tabs. They look like this:
# BilingualWorldList.txt (Les't call it FileA) vriendelik aardig irriterend vervelend losieshuis pension eksamen examen goed braaf damwal dam water water rekenaar computer outoritęr outoritaire węreld wereld alle alle word worden angesien overwegende erkenning erkenning afrigter trainer FalseFriendsList.txt (Let's call it FileB) vriendelik aardig goed braaf damwal dam bruinmens kleurling kamera fototoestel jammer sneu//spijten japon ochtendjas losieshuis pension buffer bumper bruinmens kleurling brulpadda brulkikker jammerlik zielig buffer bumper irriterend irritant//vervelend kameelperd giraf//giraffe
I want to take FileB, and with FileB search through FileA, looking for matches. The matches don't need to be 100% identical, though, because as you can see are these two entries almost exactly the same:
#FileA irriterend vervelend #FileB irriterend irritant//vervelend
Thus far, my code looks like this:
#!/usr/bin/perl-w use strict; #use warnings; use open ':utf8'; #open files open (FALSEF, "<FalseFriendsList.txt"); open (BILWL, "<BilingualWordList.txt"); #declare hashes my %falsef; my %existingfalsefriend; #while the FF input exists while (<FALSEF>) { #assign each line to $line my $line = $_; #chomp off the new line chomp $line; #increment $line $falsef{$line}++; } #declare variables my $token; my %hash; #open output files open (OUTPUT1, ">OutputFalseFriends.txt"); open (OUTPUT2, ">OutputUnsortedWordList.txt"); #while input is received while (<BILWL>) { #assign each line to $line my $line = $_; #chomp off the new line chomp $line; #assign $line to the array my @wordlist = split/\t/,$line; #a for-loop to 'clean up' the words, to get rid of all the commas, + full stops, etc, except the apstrophes and hyphens for (my $x = 0; $x <= $#wordlist; $x++) { my $token = $wordlist[$x]; if ($token =~ /(['\-\w]+)/) { #$word is now clean my $searchword = $1; #checks to see whether the word exists in the false friend +s list if (exists $hash{$searchword} || exists $falsef{$searchwor +d}) { $existingfalsefriend{$searchword}++; } else { #print to unsorted.txt print OUTPUT2 "$searchword\n"; } } } } my $searchword; foreach my $searchword(sort keys %existingfalsefriend) { #sorts the matched words alphabetically my $value = $existingfalsefriend{$searchword}; print OUTPUT1 "$searchword\t $value\n"; }
However, my output does not look like I want it to look. I want the matching lines to be written to OutputFalseFriends.txt, and the non-matching lines to be written to OutputUnsortedWorldList.txt, like this:
#OutputFalseFriends.txt vriendelik aardig losieshuis pension goed braaf damwal dam irriterend irritant//vervelend #OutputUnsortedWorldList.txt eksamen examen water water rekenaar computer outoritęr outoritaire węreld wereld alle alle word worden angesien overwegende erkenning erkenning afrigter trainer
But OutputFalseFriends.txt is empty every time and OutputUnsortedWordList.txt contains my whole inputfile BilingualWorldList.txt, just with every word on its own line. A sample is shown here:
goed braaf naak bloot damwal dam kombers deken homoseksueel flikker bronstig geil munisipaliteit gemeente
Does anyone have any advice on how I can correct this, please?
!!!!!!!!!!!!!! UPDATE !!!!!!!!!!!!!!
I finally got my program to do what I wanted it to do! (Well, part 1 of the whole program I'm trying to code, that is.. :p) Here is my code (same input as before, obviously)
#!/usr/bin/perl-w use strict; use warnings; use open ':utf8'; use autodie; #open FILE B open (FALSEFRIENDINPUT, "<SNonCognatesAndFF.txt"); #declare hash my %fileb; #get a line from #FILEB while (my $line = <FALSEFRIENDINPUT>) { #chomp off the new line chomp $line; # split the line on tab my ($filebkeys, $filebvalues) = split /\t/, $line; $fileb{$filebkeys} = $filebvalues; #open output files open (OUTPUT1, ">OutputMatchedFalseFriends.txt"); open (OUTPUT2, ">OutputNonMatchedWords.txt"); #open FILE A open (BILINGUALWL, "<BilingualWordList.1.0.0.IW.2012-06-20.txt +"); my %filea; #get a line from #FILEA while (my $line = <BILINGUALWL> ) { chomp $line; #split the line on tab my ($fileakeys, $fileavalues) = split /\t/, $line; #do first columns match? if ($fileb{$fileakeys}) { #does the second column value contain the other as a s +ubstring? if ($fileb{$fileakeys} =~ /$fileavalues/ or $fileavalu +es =~ /$fileb{$fileakeys}/) { #if yes, print it to OutputMatchedFalseFriends.txt print OUTPUT1 "$line\n"; #loop to the next line next; } } else { #if not, print it to OutputNonMatchedWords.txt print OUTPUT2 "$line\n"; } } }
And here is my output:
#OutputMatchedFalseFriends.txt damwal dam bitsig vinnig bot been dikwels vaak aantreklik knap bees rund baas chef bestuur directie alles alles afrigter trainer #OutputNonMatchedWords.txt (only a sample of a 73 line output) vriendelik aardig polisieman agent net-net amper gedierte beest goed braaf naak bloot kombers deken homoseksueel flikker bronstig geil munisipaliteit gemeente menskop hoofd toedraai inpakken kiestand kies dierekop kop
I have only one question now, though.. Sometimes (and quite randomly) when I run my program, I get the following messages (only one at a time, on a rotating basis):
<Why would this be? Does anyone know, maybe? But it doesn't hamper the output in any way..Can't open '>MatchedFalseFriends.txt' for writing: 'Invalid argument' +at Script.ExtractionofCognates.1.0.5.2012.06.28.pl line 25 #and Can't open '>OutputNonMatchedWords.txt' for writing: 'Invalid argument +' at Script.ExtractionofCognates.1.0.5.2012.06.28.pl line 25
Oh, and a big thank you to everyone for their input, code examples and opinions - especially Athanasius and aaron_baugher. I appreciate it. :)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Comparing / Searching through Hashes
by flexvault (Monsignor) on Jun 28, 2012 at 13:12 UTC | |
|
Re: Comparing / Searching through Hashes
by ww (Archbishop) on Jun 28, 2012 at 12:34 UTC | |
by SayWhat?! (Novice) on Jun 28, 2012 at 12:48 UTC | |
by Athanasius (Archbishop) on Jun 28, 2012 at 14:00 UTC | |
| |
|
Re: Comparing / Searching through Hashes
by aaron_baugher (Curate) on Jun 28, 2012 at 13:46 UTC | |
|
Re: Comparing / Searching through Hashes
by RichardK (Parson) on Jun 28, 2012 at 13:01 UTC | |
|
Re: Comparing / Searching through Hashes
by SayWhat?! (Novice) on Jun 28, 2012 at 16:43 UTC | |
by aaron_baugher (Curate) on Jun 28, 2012 at 21:15 UTC |