in reply to comparing columns and printing a result

I am having problems understanding the expected results. For example, how can you get one hit from the last line?
KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25
So far as I can see, none of the "words" in the first column appear in the second. But it does depend on what you mean by a "word". Can you please explain the match criteria?

Update: This is what I came up with:
#!/usr/bin/perl use strict; use warnings; my $readfile = 'blah.csv'; my $writefile = 'bleh.csv'; # Note that you were opening $writefile for READ open my $fh, "<", $readfile or die "Unable to open $readfile: $!"; open my $wfh, ">", $writefile or die "Unable to open $writefile: $!"; foreach (<$fh>) { $_ = uc $_; chomp; my ($col1, $col2) = split /;/; my @col1_words = split /\s+/, $col1; my @col2_words = split /\s+/, $col2; my %hash; @hash{@col1_words} = undef; my $found = 0; for my $word (@col2_words) { $found++ if exists $hash{$word} } print $wfh "$_;".@col1_words.";$found\n"; } close ($fh); close ($wfh);
Which produces:
SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +);3;2 SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S;5;3 O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56;3;0 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI;3;0 KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK;4;2 VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E;3;0 VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49;5;0 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25;4;0
Further update corrected typo.

Replies are listed 'Best First'.
Re^2: comparing columns and printing a result
by slartsa (Initiate) on Jan 26, 2009 at 14:15 UTC
    I'm sorry, I didn't explain this well. The word is to be searched within the other column as string, not as exact word. So here
    KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25
    you can see "KAAVAIN" is part of the word "TASOKAAVAIN" so it would count as an occurance. I tested your code and it appears to find occurances only if exact word is found.
      OK, except I don't see 2 hits on the second line, I see 3:
      SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S
      TP, 9501, and E-SS.

      This is my version 2:
      #!/usr/bin/perl use strict; use warnings; my $readfile = 'blah.csv'; my $writefile = 'bleh.csv'; open my $fh, "<", $readfile or die "Unable to open $readfile: $!"; open my $wfh, ">", $writefile or die "Unable to open $writefile: $!"; foreach (<$fh>) { $_ = uc $_; chomp; my ($col1, $col2) = split /;/; my @col1_words = split /\s+/, $col1; my @col2_words = split /\s+/, $col2; my $found = 0; my $pattern = join ('|', @col1_words); for my $col2_word (@col2_words) { $found++ if $col2_word =~ /$pattern/; } print $wfh "$_;".@col1_words.";$found\n"; } close ($fh); close ($wfh);
        Yeah there was three of them, my mistake, I'm sorry! Oh noes, I have pipes within my list and so running your program gives me an error, but I can always substitute them. Thank you very much! I didn't really ask for a complete program but apparently my code was so screwed up that it was the easiest approach... I need to give a look at your code a bit further so I might understand it some day.. Thank you!
        Oops, problem, earlier I said there would be pipes within the text but the program also seems to have problems when there is unclosed brackets or "+" found.. and they occur rather often :( I understand your code now but I don't know how I could fix this issue. I could easily just replace those characters but I need to keep the original file as is.