tangledupinperl has asked for the wisdom of the Perl Monks concerning the following question:

Hey monks
I need to separate out lines from file2 based on the list in file1 and shove the ones not defined in file1 into file3 ie:

file1: ab gh kl mn op file2: ab, 1234 cd, 2343 ef, 1253 gh, 4543 ij, 2343 kl, 2453 mn, 4753

This is what Ive got so far, but its just printing everything in file2, not selecting by file1 (btw this is not my code, im just trying to fix it):

#!/usr/bin/perl # # $file1 = shift; $file2 = shift; $match = 0; open(IN,"$file2") || die "Cannot open file $file2 $!\n"; while(<IN>){ chomp($_); $s_line = $_; open(INPUT,"$file1") || die "Cannot open file $file1 $!\n"; while(<INPUT>){ chomp($_); $str = $_; if($s_line =~ /$str/){ $match = 1; } } close(INPUT); if($match == 0){ print "$s_line\n"; } $match = 0; } close(IN);

any ideas?

Replies are listed 'Best First'.
Re: list lines not found in config (while+if)
by kennethk (Abbot) on Apr 07, 2009 at 16:40 UTC

    Executing your script with the provided files yields the following results for me:

    > perl script.pl file1 file2 cd, 2343 ef, 1253 ij, 2343

    Is this not the expected result? I suspect your issue is on your command line.

    On a side note, it is very dangerous to pass a command line argument directly to a two-argument open as you have done - it allows execution of arbitrary code. You should also check out using the pragmas strict and warnings to save you some potential headaches. Might I suggest the following code?

    #!/usr/bin/perl # # use strict; use warnings; my $file1 = shift; my $file2 = shift; my $match = 0; open(IN, "<", $file2) or die "Cannot open file $file2 $!\n"; while(<IN>){ chomp($_); my $s_line = $_; open(INPUT, "<", $file1) or die "Cannot open file $file1 $!\n"; while(<INPUT>){ chomp($_); my $str = $_; if($s_line =~ /$str/){ $match = 1; } } close(INPUT); if($match == 0){ print "$s_line\n"; } $match = 0; } close(IN);

Re: list lines not found in config (while+if)
by jethro (Monsignor) on Apr 07, 2009 at 16:50 UTC

    Yes. Use a hash. If file1 has more than 10000 or 100000 lines then you need a disk-based hash or database, but normally you just use something like this:

    use strict; use warnings; my %seeninfile1; open FH,"<",$file1 or die "Can't open $file1: $!\n"; while (<FH>) { $seeninfile1{$_}++; } close(FH); open FG,"<",$file2 or die "Can't open $file2: $!\n"; while (<FG>) { ($key,$number)= split; if ($seeninfile1{$key} { #do whatever you want to do if key is not in file1 } else { #do whatever you want to do if key is in file1 } } close(FH);

    Using a hash means you will read both file1 and file2 only once. Your code reads in file1 once for every line(!) of file2. Your code will quickly slow down when your files get bigger.

Re: list lines not found in config (while+if)
by toolic (Bishop) on Apr 07, 2009 at 16:52 UTC
    My results agree with kennethk's.

    It sounds like you are describing egrep, if you have that available to you on your OS:

    $ egrep -v -f file1 file2 > file3

    Update: Since you have not given us enough to reproduce your problem, I recommend that you start sprinkling your code with more print's. Refer to Basic debugging checklist for more details.

Re: list lines not found in config (while+if)
by jeanluca (Deacon) on Apr 07, 2009 at 17:01 UTC
    Here an example using map and grep
    #!/usr/bin/perl use strict ; use warnings ; my $file1 = shift; my $file2 = shift; open(IN,$file1) || die "Cannot open file $file1 $!\n"; my $content ; { local $/ ; $content = <IN> } close IN ; open(IN,$file2) || die "Cannot open file $file2 $!\n"; my @list = <IN> ; close IN ; my @new = grep($content !~ /$_->[0]\n/, map( [split(/,/, $_)], @list ) +) ; foreach( @new ) { print $_->[0] . " " .$_->[1] ; }

    Cheers
    LuCa
Re: list lines not found in config (while+if)
by Marshall (Canon) on Apr 07, 2009 at 18:51 UTC
    Your code looks pretty good. I just recoded it a bit. This should be ok on either Windows or Unix(the chomp() gets rid of whatever line terminator is there and the way I parsed the 2nd file throws that away too. Maybe you have some non-printing garbage in there? Or your "real thing" is a bit different from this example?
    #!/usr/bin/perl -w use strict; die "Usage fileReference File2Check >outfile" if @ARGV != 2; my ($fileRef, $file2Check) = @ARGV; open (REF, "<$fileRef") || die "unable to open $fileRef"; open (CHK, "<$file2Check") || die "unable to open $file2Check"; my %seen = map{chomp; $_ => 1 }(<REF>); print grep { my $token = (split /,/,$_)[0]; !$seen{$token} }(<CHK>); __END__ fileref: ab gh kl mn op file2check: ab, 1234 cd, 2343 ef, 1253 gh, 4543 ij, 2343 kl, 2453 mn, 4753 output: cd, 2343 ef, 1253 ij, 2343

      Note kennethk's comments regarding the two parameter open in his reply to the OP!

      What kennethk forgot to mention was that you should use lexical file handles too.

      open ... $filename || die makes for an unhappy life. || binds to $filename, not to the result of open as you may be hoping. $filename is true for all likely values so the die will never fire, regardless of what the result of the open is! Use open ... $filename or die instead. It's often helpful in the die to show the error message associated with the open failure using $OS_ERROR ($!).

      Making those changes, removing extraneous () and minor adjusting of white space produces the following (untested) code:

      #!/usr/bin/perl use strict; use warnings; die "Usage fileReference File2Check >outfile" if @ARGV != 2; my ($fileRef, $file2Check) = @ARGV; open my $fileREFIn, '<', $fileRef or die "unable to open $fileRef: $!" +; open my $fileCHKIn, '<', $file2Check or die "unable to open $file2Chec +k: $!"; my %seen = map { chomp; $_ => 1 } <$fileREFIn>; print grep {! $seen{(split /,/, $_)[0]}} <$fileCHKIn>;

      True laziness is hard work
        Great points Grandfather!

        I think there can be some legitimate differences of opinions on these things.

        First on the subject of:
        open ... $filename || die makes for an unhappy life.

        Out of force of habit, I use more parens so that this sort of thing is not a problem. open (...) or die "..." is the same as open (...) || die "...". Which is NOT the same as open ... || die "...". So you are correct that there is a potential problem here! I recommend to always use parens to make things clear. Especially on calls to the O/S!

        Use of ?! is a grey area here.
        Probably more important is one thing that we didn't talk about: the significance of \n in a "die". If there is no \n in the "die text" Perl will report the text message and the program line number. If I get called on the phone by a user with a fatal error, that is very useful information to me! If there is a \n in the die, I won't get the line number! Whether or not there is a ?! error is of much less importance.
        So if user types: C:\PROJECTS\PerlMonks>test.pl f3 f2.txt
        ERRORMSG: unable to open f3 at C:\PROJECTS\PerlMonks\test.pl line 6.
        I know what happened. If we have the ?! also, then we get:
        ERRORMSG: unable to open f3 No such file or directory at C:\PROJECTS\PerlMonks\test.pl line 6.
        That in this case is pretty much the same thing.

        Most important is a good error message and leaving off the \n in the die statement. BUT, I would agree to stick that $! thing in there! I usually do it, but in this case sometimes we overburden new folks with the 2nd level of detail that isn't so important at the time.

        I would like to be educated re: security holes. For these very short 10 line things, I don't see a problem with the way that I opened the 2 read-only files. Stuff that comes from cgi scripts etc is way different. There isn't a problem here, but I suspect that the answer will be "hey, there could be a problem in a another situation...".

Re: list lines not found in config (while+if)
by wol (Hermit) on Apr 07, 2009 at 17:02 UTC
    As to why the your output is different from others who have tried, I'd suggest looking for different line endings (DOS vs Unix), extra spaces, and/or Unicode in your input files.

    If all else fails, upload all your data to perlmonks, and then download it again - this seems to work for everyone else. (Note - in case it's not obvious, this is not a serious suggestion!)

    --
    use JAPH;
    print JAPH::asString();

Re: list lines not found in config (while+if)
by tangledupinperl (Initiate) on Apr 07, 2009 at 22:56 UTC
    Cheers for the help guys, but im still not getting it. Truth be told, i hadn't tried the script with the examples I gave you, just gave those for ease of explaination. But when a few of you said that it worked for you, I copied back what inputs I'd wrote, tried it and I get no output at all. My actual file1 and file2 are 8000 queries and 12000 lines to select from
    so:
    my script + my (8000) inputs = print everything
    my script + made up (5) inputs = print nothing
    kennethk's script + my (8000) inputs = print everything
    kennethk's script + made up (5) inputs = print nothing

    seen as two computers are getting two diff results is there an overall problem? I know a bad workman blames his tools but could it be?....
    whats the chance of me missing some module/update/package? (clutching at straws here)

    my actual files go to the tune of:

    File1 GP_MASA_01F04_c GP_MASA_38C02_c GP_MASA_33B06_c GP_MASA_24D04_c GP_MASA_35A04_c ...to 9000 lines File2 (is a .csv file) 'GP_MASA_01F04_c',681,'ACCACACATCATCTGACTTACGTACGTACG...... 'GP_MASA_38C02_c',273,'ACATCCTTCACAGAAGTTTGT............. 'GP_MASA_33B06_c',288,'ACATACTAACACGGTCTTT............... .....to 12400 lines

    also, I intend to have a go with all the other scripts and tips you kind kind people have put up here but its the middle of the night and Im falling asleep where im sitting!

    thank you again for all the help

      Taking your sample data and original code I've generated the following sample code. Note that I added strictures and cleaned up a few other aspects of you code. I also removed the first line of your reference file so that at least one "missing" line would be reported.

      use strict; use warnings; my $File1 = <<END_FILE1; GP_MASA_38C02_c GP_MASA_33B06_c GP_MASA_24D04_c GP_MASA_35A04_c END_FILE1 my $File2 = <<END_FILE2; 'GP_MASA_01F04_c',681,'ACCACACATCATCTGACTTACGTACGTACG...... 'GP_MASA_38C02_c',273,'ACATCCTTCACAGAAGTTTGT............. 'GP_MASA_33B06_c',288,'ACATACTAACACGGTCTTT............... END_FILE2 my $match = 0; open my $dataIn, '<', \$File2; while (<$dataIn>) { chomp ($_); my $dataLine = $_; open my $refIn, '<', \$File1; while (<$refIn>) { chomp ($_); my $str = $_; if ($dataLine =~ /$str/) { $match = 1; } } close ($refIn); if ($match == 0) { print "$dataLine\n"; } $match = 0; }

      Prints:

      'GP_MASA_01F04_c',681,'ACCACACATCATCTGACTTACGTACGTACG......

      Although reparsing the reference file for each line of the data file is exceedingly nasty, the code works. Maybe you can update the sample to demonstrate where you are seeing a problem?


      True laziness is hard work
        Ive gone through and put print's throughout the script and ive finaly found out the problem! its the bloody matching string. there is something in my reference file that matching each time it reads the data file
        i did a:
        if($s_line =~ /$str/){ print "$str -- match\n"; } else { print "$str -- no match\n"; }
        on my test data and it shown that its matching an empty line so there must be something in my reference file thats matching up (ive already checked and its not empty lines). I'm gonna look for it now so i wont keep you hanging on, but cheers for all the improvement suggestions. Im going to go through them all when I have more time and no deadline to catch
        cheers for the help!
        in heinsight, I should have given better examples at the start. sorry. will do next time