comparing columns and printing a result

slartsa has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: comparing columns and printing a result by johngg (Canon) on Jan 26, 2009 at 12:08 UTC
It would be helpful if you showed us a sample of your data (not all 5000+ lines :-) and the results you expect from it. I'm afraid I'm not quite sure what you are trying to do from your description. A couple of points with your code. You try to read from your `$readfile` before you have opened it. You `open` your `$writefile` for reading, use `">"` for writing. Your `my @words = split(//,@col1);` will split into individual characters, not words. You could use `/\s+/` to split on whitespace to get words (depending on what your data looks like.) I hope these initial impressions are of use to you. Cheers, JohnGG Update: Two more things I just noticed. You seem to be trying to re-`open` your `$wfh` inside the read loop of `$fh` You read your `$fh` in a foreach loop. This will have the effect of reading all of the file into memory then iterating over it. Use a while loop instead which will really read the file a line at a time.	[reply] [d/l] [select]
Re^2: comparing columns and printing a result by slartsa (Initiate) on Jan 26, 2009 at 12:55 UTC
This is an example of the source file: `SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +) SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25` [download] This is supposed to be the goal: SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +);3;2 SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S;5;2 O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56;3;0 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI;3;0 KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK;4;3 VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E;3;2 VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49;5;2 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25;4;1 [download] Structure is: "word1 word2 word3;bla hword1bla word2h blah;3(words in 1st column);2(words found a match from 2nd column)" Ok I made changes in the code which now looks like this: `#!perl use strict; use warnings; my $readfile = 'blah.csv'; my $writefile = 'bleh.csv'; open my $fh, "<", $readfile; my $row = <$fh>; my $found = 0; my @cols; my (@col1,@col2); my @words = split(/\s+/,@col1); open my $wfh, ">", $writefile or die "yikes: $!"; while (<$fh>) { tr /a-z/A-Z/; chomp; my @cols = split /\;/; push @col1, $cols[0]; push @col2, $cols[1]; my @words = @words + 1; if ( @col2 =~ m/$words[$_](\d+)/ ) { $found++; } print $wfh "$row';'@words';'$found"; $found = 0; @words = 0; }` [download] Now I get report: `Applying pattern match (m//) to @array will act on scalar(@array) at C +:\blah\vertailu2.pl line 27. Argument "LSAIDHA 2FA SFF ;ASD 2FA AASDA" isn't numeric in array eleme +nt at C:\blah\vertailu2.pl line 27, <$fh> Argument "3FASFL FAAL;DAOIADJAD" isn't numeric in array element at C:\ +blah\vertailu2.pl line 27, <$fh> line Argument "ASFD ADD AD7A ALUYAD;ADLIHADBA A DADASFD DADD" isn't numeric + in array element at C:\blah\vertailu2.pl line 27, <$fh> line 4.` [download] I'm guessing I should somehow define the program to handle both numbers and letters. Don't know how to do it though..	[reply] [d/l] [select]
Re^3: comparing columns and printing a result by johngg (Canon) on Jan 26, 2009 at 16:16 UTC
I have taken a different approach to the hash based one of cdarke and have used regular expression matching instead. The regular expression is an alternation of the words found in `$col1` and doing a global match against `$col2` will find all matches. We are not interested in the text of the matches, just the number which is what the `my $matches = () = ...` construct achieves. Note that I'm only reading your data from a HEREDOC and writing to a variable just to keep everything inside the script on my system. Just substitute normal files if you use some of this code. use strict; use warnings; open my $inFH, q{<}, \ <<EOF or die qq{open: << HEREDOC: $!\n}; SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +) SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25 EOF my $outFile; open my $outFH, q{>}, \ $outFile or die qq{open: > \ $outFile: $!\n}; while( <$inFH> ) { chomp; my( $col1, $col2 ) = map uc, split m{;}; my $rxCol1 = do { local $" = q{\|}; qr{@{ [ map quotemeta, split m{\s+}, $col1 ] }} }; my @col1Words = split m{\s+}, $col1; my $matches = () = $col2 =~ m{$rxCol1}g; print $outFH join( q{;}, $col1, $col2, scalar @col1Words, $matches ), qq{\n}; } close $inFH or die qq{close: << HEREDOC: $!\n}; close $outFH or die qq{close: > \ $outFile: $!\n}; print $outFile; [download] The output. SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +);3;2 SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S;5;3 O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56;3;0 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI;3;0 KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK;4;3 VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E;3;2 VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49;5;2 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25;4;1 [download] I hope this is helpful. Cheers, JohnGG Update: I just noticed the bit about pipes in your data so added a quotemeta in the regex.	[reply] [d/l] [select]
Re: comparing columns and printing a result by JavaFan (Canon) on Jan 26, 2009 at 12:10 UTC
Well, one of your first statements is `my $row = <$fh>;` [download] Not only haven't you declared $fh yet, you haven't opened the file at this point. That's something you do later. But when you open the files, you're not checking whether this succeeds. `my (@col1,@col2); my @words = split(//,@col1);` [download] What's the point of this? @col1 doesn't contain anything, so what you want to split? You're also declaring @col, but not using it. `@col2 =~ m/$words[$_](\d+)/` [download] No idea what you want to do, but on the LHS of a =~ you have to have a scalar, not a list or array. Furthermore, since you never put anything in @words, there will be nothing to match here. And even then, $_ is the line you're reading from the file; it probably doesn't contain a number.	[reply] [d/l] [select]
Re^2: comparing columns and printing a result by slartsa (Initiate) on Jan 26, 2009 at 13:03 UTC
Thank you for reply. col1 should be everything before column separator and I was seeking to split the words in col1 as array. I don't know if it's the best idea when trying to handle words one by one as search value but that's something I came up with in first place.	[reply]
Re: comparing columns and printing a result by cdarke (Prior) on Jan 26, 2009 at 13:50 UTC
I am having problems understanding the expected results. For example, how can you get one hit from the last line? `KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25` [download] So far as I can see, none of the "words" in the first column appear in the second. But it does depend on what you mean by a "word". Can you please explain the match criteria? Update: This is what I came up with: #!/usr/bin/perl use strict; use warnings; my $readfile = 'blah.csv'; my $writefile = 'bleh.csv'; # Note that you were opening $writefile for READ open my $fh, "<", $readfile or die "Unable to open $readfile: $!"; open my $wfh, ">", $writefile or die "Unable to open $writefile: $!"; foreach (<$fh>) { $_ = uc $_; chomp; my ($col1, $col2) = split /;/; my @col1_words = split /\s+/, $col1; my @col2_words = split /\s+/, $col2; my %hash; @hash{@col1_words} = undef; my $found = 0; for my $word (@col2_words) { $found++ if exists $hash{$word} } print $wfh "$_;".@col1_words.";$found\n"; } close ($fh); close ($wfh); [download] Which produces: SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +);3;2 SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S;5;3 O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56;3;0 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI;3;0 KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK;4;2 VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E;3;0 VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49;5;0 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25;4;0 [download] Further update corrected typo.	[reply] [d/l] [select]
Re^2: comparing columns and printing a result by slartsa (Initiate) on Jan 26, 2009 at 14:15 UTC
I'm sorry, I didn't explain this well. The word is to be searched within the other column as string, not as exact word. So here `KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25` [download] you can see "KAAVAIN" is part of the word "TASOKAAVAIN" so it would count as an occurance. I tested your code and it appears to find occurances only if exact word is found.	[reply] [d/l]
Re^3: comparing columns and printing a result by cdarke (Prior) on Jan 26, 2009 at 14:27 UTC
OK, except I don't see 2 hits on the second line, I see 3: `SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S` [download] TP, 9501, and E-SS. This is my version 2: #!/usr/bin/perl use strict; use warnings; my $readfile = 'blah.csv'; my $writefile = 'bleh.csv'; open my $fh, "<", $readfile or die "Unable to open $readfile: $!"; open my $wfh, ">", $writefile or die "Unable to open $writefile: $!"; foreach (<$fh>) { $_ = uc $_; chomp; my ($col1, $col2) = split /;/; my @col1_words = split /\s+/, $col1; my @col2_words = split /\s+/, $col2; my $found = 0; my $pattern = join ('\|', @col1_words); for my $col2_word (@col2_words) { $found++ if $col2_word =~ /$pattern/; } print $wfh "$_;".@col1_words.";$found\n"; } close ($fh); close ($wfh); [download]	[reply] [d/l] [select]
Re^4: comparing columns and printing a result by slartsa (Initiate) on Jan 26, 2009 at 14:42 UTC
Re^5: comparing columns and printing a result by cdarke (Prior) on Jan 26, 2009 at 15:57 UTC
Re^4: comparing columns and printing a result by slartsa (Initiate) on Jan 27, 2009 at 07:25 UTC