slartsa has asked for the wisdom of the Perl Monks concerning the following question:

My perl cherry has popped with this one so please be gentle. I have a long (5000+ lines) list of lines with two columns, separated by semicolon. I am supposed to search the words (one by one) in first column from the second column and print a result of hits into a new column, so I started making this script.
#!perl use strict; use warnings; my $readfile = 'blah.csv'; my $writefile = 'bleh.csv'; my $row = <$fh>; my $found = 0; my @cols; my (@col1,@col2); my @words = split(//,@col1); open my $fh, "<", $readfile; open my $wfh, "<", $writefile; foreach (<$fh>) { tr /a-z/A-Z/; chomp; my @cols = split /\;/; push @col1, $cols[0]; push @col2, $cols[1]; if ( @col2 =~ m/$words[$_](\d+)/ ) { $found++; @words++; } open($wfh) or die "YARRG: $!"; print $wfh "$row';'@words';'$found"; $found = 0; @words = 0; }
I have absolutely no idea of coding so please don't insult me if you find any brain misusage from within the code =) I'm getting this report when trying to run it:
Applying pattern match (m//) to @array will act on scalar(@array) at C +:\blah\vertailu2.pl line 24. Global symbol "$fh" requires explicit package name at C:\blah\vertailu +2.pl line 8. Execution of C:\blah\vertailu2.pl aborted due to compilation errors.
I don't know how I could fix those, help please? As many of you guys might already have noticed it is not even my intention to print the results into the file that is being read from but the whole thing is copied into a new file with the result intact. All help is appreciated! edit: oops, "(a)words" is not supposed to be in here but it doesn't really matter concerning the result.
if ( @col2 =~ m/$words[$_](\d+)/ ) { $found++; @words++; }

Replies are listed 'Best First'.
Re: comparing columns and printing a result
by johngg (Canon) on Jan 26, 2009 at 12:08 UTC

    It would be helpful if you showed us a sample of your data (not all 5000+ lines :-) and the results you expect from it. I'm afraid I'm not quite sure what you are trying to do from your description.

    A couple of points with your code.

    • You try to read from your $readfile before you have opened it.
    • You open your $writefile for reading, use ">" for writing.
    • Your my @words = split(//,@col1); will split into individual characters, not words. You could use /\s+/ to split on whitespace to get words (depending on what your data looks like.)

    I hope these initial impressions are of use to you.

    Cheers,

    JohnGG

    Update: Two more things I just noticed.

    • You seem to be trying to re-open your $wfh inside the read loop of $fh
    • You read your $fh in a foreach loop. This will have the effect of reading all of the file into memory then iterating over it. Use a while loop instead which will really read the file a line at a time.

      This is an example of the source file:
      SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +) SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25
      This is supposed to be the goal:
      SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +);3;2 SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S;5;2 O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56;3;0 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI;3;0 KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK;4;3 VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E;3;2 VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49;5;2 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25;4;1
      Structure is: "word1 word2 word3;bla hword1bla word2h blah;3(words in 1st column);2(words found a match from 2nd column)" Ok I made changes in the code which now looks like this:
      #!perl use strict; use warnings; my $readfile = 'blah.csv'; my $writefile = 'bleh.csv'; open my $fh, "<", $readfile; my $row = <$fh>; my $found = 0; my @cols; my (@col1,@col2); my @words = split(/\s+/,@col1); open my $wfh, ">", $writefile or die "yikes: $!"; while (<$fh>) { tr /a-z/A-Z/; chomp; my @cols = split /\;/; push @col1, $cols[0]; push @col2, $cols[1]; my @words = @words + 1; if ( @col2 =~ m/$words[$_](\d+)/ ) { $found++; } print $wfh "$row';'@words';'$found"; $found = 0; @words = 0; }
      Now I get report:
      Applying pattern match (m//) to @array will act on scalar(@array) at C +:\blah\vertailu2.pl line 27. Argument "LSAIDHA 2FA SFF ;ASD 2FA AASDA" isn't numeric in array eleme +nt at C:\blah\vertailu2.pl line 27, <$fh> Argument "3FASFL FAAL;DAOIADJAD" isn't numeric in array element at C:\ +blah\vertailu2.pl line 27, <$fh> line Argument "ASFD ADD AD7A ALUYAD;ADLIHADBA A DADASFD DADD" isn't numeric + in array element at C:\blah\vertailu2.pl line 27, <$fh> line 4.
      I'm guessing I should somehow define the program to handle both numbers and letters. Don't know how to do it though..

        I have taken a different approach to the hash based one of cdarke and have used regular expression matching instead. The regular expression is an alternation of the words found in $col1 and doing a global match against $col2 will find all matches. We are not interested in the text of the matches, just the number which is what the my $matches = () = ... construct achieves.

        Note that I'm only reading your data from a HEREDOC and writing to a variable just to keep everything inside the script on my system. Just substitute normal files if you use some of this code.

        use strict; use warnings; open my $inFH, q{<}, \ <<EOF or die qq{open: << HEREDOC: $!\n}; SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +) SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25 EOF my $outFile; open my $outFH, q{>}, \ $outFile or die qq{open: > \ $outFile: $!\n}; while( <$inFH> ) { chomp; my( $col1, $col2 ) = map uc, split m{;}; my $rxCol1 = do { local $" = q{|}; qr{@{ [ map quotemeta, split m{\s+}, $col1 ] }} }; my @col1Words = split m{\s+}, $col1; my $matches = () = $col2 =~ m{$rxCol1}g; print $outFH join( q{;}, $col1, $col2, scalar @col1Words, $matches ), qq{\n}; } close $inFH or die qq{close: << HEREDOC: $!\n}; close $outFH or die qq{close: > \ $outFile: $!\n}; print $outFile;

        The output.

        SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +);3;2 SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S;5;3 O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56;3;0 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI;3;0 KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK;4;3 VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E;3;2 VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49;5;2 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25;4;1

        I hope this is helpful.

        Cheers,

        JohnGG

        Update: I just noticed the bit about pipes in your data so added a quotemeta in the regex.

Re: comparing columns and printing a result
by JavaFan (Canon) on Jan 26, 2009 at 12:10 UTC
    Well, one of your first statements is
    my $row = <$fh>;
    Not only haven't you declared $fh yet, you haven't opened the file at this point. That's something you do later. But when you open the files, you're not checking whether this succeeds.
    my (@col1,@col2); my @words = split(//,@col1);
    What's the point of this? @col1 doesn't contain anything, so what you want to split? You're also declaring @col, but not using it.
    @col2 =~ m/$words[$_](\d+)/
    No idea what you want to do, but on the LHS of a =~ you have to have a scalar, not a list or array. Furthermore, since you never put anything in @words, there will be nothing to match here. And even then, $_ is the line you're reading from the file; it probably doesn't contain a number.
      Thank you for reply. col1 should be everything before column separator and I was seeking to split the words in col1 as array. I don't know if it's the best idea when trying to handle words one by one as search value but that's something I came up with in first place.
Re: comparing columns and printing a result
by cdarke (Prior) on Jan 26, 2009 at 13:50 UTC
    I am having problems understanding the expected results. For example, how can you get one hit from the last line?
    KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25
    So far as I can see, none of the "words" in the first column appear in the second. But it does depend on what you mean by a "word". Can you please explain the match criteria?

    Update: This is what I came up with:
    #!/usr/bin/perl use strict; use warnings; my $readfile = 'blah.csv'; my $writefile = 'bleh.csv'; # Note that you were opening $writefile for READ open my $fh, "<", $readfile or die "Unable to open $readfile: $!"; open my $wfh, ">", $writefile or die "Unable to open $writefile: $!"; foreach (<$fh>) { $_ = uc $_; chomp; my ($col1, $col2) = split /;/; my @col1_words = split /\s+/, $col1; my @col2_words = split /\s+/, $col2; my %hash; @hash{@col1_words} = undef; my $found = 0; for my $word (@col2_words) { $found++ if exists $hash{$word} } print $wfh "$_;".@col1_words.";$found\n"; } close ($fh); close ($wfh);
    Which produces:
    SUUTIN STAMM 2/60AST;SUUTIN STAMM 2,0/60 AST. VIIRAOSA (PK-1 +);3;2 SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S;5;3 O-RENGAS 790X12 FPM;UPPOPUMPPU PUMPEX KV56;3;0 O-RENGAS 99X7 FPM;RULLAUSPÄÄ B NORMAALI;3;0 KÄSISAHA SANDV 2600-22-XT L=22IN;KÄSISAHA 22 IN 2600-22-XT SA +NDVIK;4;2 VEITSI STANL 10-010;MATTOVEITSI STANLEY 2-10-099 99E;3;0 VEITSENTERÄ STANL 11-916 L=62MM SUORA;MATTOVEITSENTERÄ STANLEY 0-11-9 +21 (5KPL/PAK) PITT. 49;5;0 KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25;4;0
    Further update corrected typo.
      I'm sorry, I didn't explain this well. The word is to be searched within the other column as string, not as exact word. So here
      KAAVAIN TIZIT 620-20 H10;TASOKAAVAIN SANDVIK 620-25
      you can see "KAAVAIN" is part of the word "TASOKAAVAIN" so it would count as an occurance. I tested your code and it appears to find occurances only if exact word is found.
        OK, except I don't see 2 hits on the second line, I see 3:
        SUUTINKÄRKI SPRAY TP 9501 E-SS;SUUTIN VALME HOL0125624 TP 9501 E-S +S
        TP, 9501, and E-SS.

        This is my version 2:
        #!/usr/bin/perl use strict; use warnings; my $readfile = 'blah.csv'; my $writefile = 'bleh.csv'; open my $fh, "<", $readfile or die "Unable to open $readfile: $!"; open my $wfh, ">", $writefile or die "Unable to open $writefile: $!"; foreach (<$fh>) { $_ = uc $_; chomp; my ($col1, $col2) = split /;/; my @col1_words = split /\s+/, $col1; my @col2_words = split /\s+/, $col2; my $found = 0; my $pattern = join ('|', @col1_words); for my $col2_word (@col2_words) { $found++ if $col2_word =~ /$pattern/; } print $wfh "$_;".@col1_words.";$found\n"; } close ($fh); close ($wfh);