Re: list lines not found in config (while+if)
by kennethk (Abbot) on Apr 07, 2009 at 16:40 UTC
|
> perl script.pl file1 file2
cd, 2343
ef, 1253
ij, 2343
Is this not the expected result? I suspect your issue is on your command line.
On a side note, it is very dangerous to pass a command line argument directly to a two-argument open as you have done - it allows execution of arbitrary code. You should also check out using the pragmas strict and warnings to save you some potential headaches. Might I suggest the following code?
#!/usr/bin/perl
#
#
use strict;
use warnings;
my $file1 = shift;
my $file2 = shift;
my $match = 0;
open(IN, "<", $file2) or die "Cannot open file $file2 $!\n";
while(<IN>){
chomp($_);
my $s_line = $_;
open(INPUT, "<", $file1) or die "Cannot open file $file1 $!\n";
while(<INPUT>){
chomp($_);
my $str = $_;
if($s_line =~ /$str/){
$match = 1;
}
}
close(INPUT);
if($match == 0){
print "$s_line\n";
}
$match = 0;
}
close(IN);
| [reply] [d/l] [select] |
Re: list lines not found in config (while+if)
by jethro (Monsignor) on Apr 07, 2009 at 16:50 UTC
|
Yes. Use a hash. If file1 has more than 10000 or 100000 lines then you need a disk-based hash or database, but normally you just use something like this:
use strict;
use warnings;
my %seeninfile1;
open FH,"<",$file1 or die "Can't open $file1: $!\n";
while (<FH>) {
$seeninfile1{$_}++;
}
close(FH);
open FG,"<",$file2 or die "Can't open $file2: $!\n";
while (<FG>) {
($key,$number)= split;
if ($seeninfile1{$key} {
#do whatever you want to do if key is not in file1
}
else {
#do whatever you want to do if key is in file1
}
}
close(FH);
Using a hash means you will read both file1 and file2 only once. Your code reads in file1 once for every line(!) of file2. Your code will quickly slow down when your files get bigger.
| [reply] [d/l] |
Re: list lines not found in config (while+if)
by toolic (Bishop) on Apr 07, 2009 at 16:52 UTC
|
My results agree with kennethk's.
It sounds like you are describing egrep, if you have that available to you on your OS:
$ egrep -v -f file1 file2 > file3
Update: Since you have not given us enough to reproduce your problem, I recommend that you start sprinkling your code with more print's. Refer to Basic debugging checklist for more details. | [reply] [d/l] |
Re: list lines not found in config (while+if)
by jeanluca (Deacon) on Apr 07, 2009 at 17:01 UTC
|
Here an example using map and grep
#!/usr/bin/perl
use strict ;
use warnings ;
my $file1 = shift;
my $file2 = shift;
open(IN,$file1) || die "Cannot open file $file1 $!\n";
my $content ;
{ local $/ ;
$content = <IN>
}
close IN ;
open(IN,$file2) || die "Cannot open file $file2 $!\n";
my @list = <IN> ;
close IN ;
my @new = grep($content !~ /$_->[0]\n/, map( [split(/,/, $_)], @list )
+) ;
foreach( @new ) {
print $_->[0] . " " .$_->[1] ;
}
Cheers
LuCa | [reply] [d/l] |
Re: list lines not found in config (while+if)
by Marshall (Canon) on Apr 07, 2009 at 18:51 UTC
|
Your code looks pretty good. I just recoded it a bit. This should be ok on either Windows or Unix(the chomp() gets rid of whatever line terminator is there and the way I parsed the 2nd file throws that away too. Maybe you have some non-printing garbage in there? Or your "real thing" is a bit different from this example?
#!/usr/bin/perl -w
use strict;
die "Usage fileReference File2Check >outfile" if @ARGV != 2;
my ($fileRef, $file2Check) = @ARGV;
open (REF, "<$fileRef") || die "unable to open $fileRef";
open (CHK, "<$file2Check") || die "unable to open $file2Check";
my %seen = map{chomp; $_ => 1 }(<REF>);
print grep { my $token = (split /,/,$_)[0];
!$seen{$token}
}(<CHK>);
__END__
fileref:
ab
gh
kl
mn
op
file2check:
ab, 1234
cd, 2343
ef, 1253
gh, 4543
ij, 2343
kl, 2453
mn, 4753
output:
cd, 2343
ef, 1253
ij, 2343
| [reply] [d/l] |
|
|
Note kennethk's comments regarding the two parameter open in his reply to the OP!
What kennethk forgot to mention was that you should use lexical file handles too.
open ... $filename || die makes for an unhappy life. || binds to $filename, not to the result of open as you may be hoping. $filename is true for all likely values so the die will never fire, regardless of what the result of the open is! Use open ... $filename or die instead. It's often helpful in the die to show the error message associated with the open failure using $OS_ERROR ($!).
Making those changes, removing extraneous () and minor adjusting of white space produces the following (untested) code:
#!/usr/bin/perl
use strict;
use warnings;
die "Usage fileReference File2Check >outfile" if @ARGV != 2;
my ($fileRef, $file2Check) = @ARGV;
open my $fileREFIn, '<', $fileRef or die "unable to open $fileRef: $!"
+;
open my $fileCHKIn, '<', $file2Check or die "unable to open $file2Chec
+k: $!";
my %seen = map { chomp; $_ => 1 } <$fileREFIn>;
print grep {! $seen{(split /,/, $_)[0]}} <$fileCHKIn>;
True laziness is hard work
| [reply] [d/l] [select] |
|
|
Great points Grandfather!
I think there can be some legitimate differences of opinions
on these things.
First on the subject of:
open ... $filename || die makes for an unhappy life.
Out of force of habit, I use more parens so that this sort of thing is not
a problem. open (...) or die "..." is the same as open (...) || die "...".
Which is NOT the same as open ... || die "...". So you are correct that there
is a potential problem here! I recommend to always use parens to make things
clear. Especially on calls to the O/S!
Use of ?! is a grey area here. Probably more important is one thing that
we didn't talk about: the significance of \n in a "die". If there is no
\n in the "die text" Perl will report the text message and the program line
number. If I get called on the phone by a user with a fatal error, that is
very useful information to me! If there is a \n in the die, I won't get the line number! Whether or not there is a ?! error is of much
less importance. So if user types: C:\PROJECTS\PerlMonks>test.pl f3 f2.txt
ERRORMSG: unable to open f3 at C:\PROJECTS\PerlMonks\test.pl line 6.
I know what happened. If we have the ?! also, then we get:
ERRORMSG: unable to open f3 No such file or directory at C:\PROJECTS\PerlMonks\test.pl line 6. That in this case is pretty much the same thing. Most important is a good error message and leaving off the \n in the die statement. BUT, I would agree to stick that $! thing in there! I usually do it, but in this case sometimes we overburden new folks with the 2nd level of detail that isn't so important at the time.
I would like to be educated re: security holes. For these very short 10 line things, I don't see a problem with the way that I opened the 2 read-only files. Stuff that comes from cgi scripts etc is way different. There isn't a problem here, but I suspect that the answer will be "hey, there could be a problem in a another situation...".
| [reply] [d/l] |
|
|
|
|
Re: list lines not found in config (while+if)
by wol (Hermit) on Apr 07, 2009 at 17:02 UTC
|
As to why the your output is different from others who have tried, I'd suggest looking for different line endings (DOS vs Unix), extra spaces, and/or Unicode in your input files.
If all else fails, upload all your data to perlmonks, and then download it again - this seems to work for everyone else. (Note - in case it's not obvious, this is not a serious suggestion!)
--
use JAPH;
print JAPH::asString();
| [reply] |
Re: list lines not found in config (while+if)
by tangledupinperl (Initiate) on Apr 07, 2009 at 22:56 UTC
|
Cheers for the help guys, but im still not getting it. Truth be told, i hadn't tried the script with the examples I gave you, just gave those for ease of explaination. But when a few of you said that it worked for you, I copied back what inputs I'd wrote, tried it and I get no output at all. My actual file1 and file2 are 8000 queries and 12000 lines to select from
so:
my script + my (8000) inputs = print everything
my script + made up (5) inputs = print nothing
kennethk's script + my (8000) inputs = print everything
kennethk's script + made up (5) inputs = print nothing
seen as two computers are getting two diff results is there an overall problem? I know a bad workman blames his tools but could it be?.... whats the chance of me missing some module/update/package? (clutching at straws here)
my actual files go to the tune of:
File1
GP_MASA_01F04_c
GP_MASA_38C02_c
GP_MASA_33B06_c
GP_MASA_24D04_c
GP_MASA_35A04_c ...to 9000 lines
File2 (is a .csv file)
'GP_MASA_01F04_c',681,'ACCACACATCATCTGACTTACGTACGTACG......
'GP_MASA_38C02_c',273,'ACATCCTTCACAGAAGTTTGT.............
'GP_MASA_33B06_c',288,'ACATACTAACACGGTCTTT...............
.....to 12400 lines
also, I intend to have a go with all the other scripts and tips you kind kind people have put up here but its the middle of the night and Im falling asleep where im sitting!
thank you again for all the help | [reply] [d/l] |
|
|
use strict;
use warnings;
my $File1 = <<END_FILE1;
GP_MASA_38C02_c
GP_MASA_33B06_c
GP_MASA_24D04_c
GP_MASA_35A04_c
END_FILE1
my $File2 = <<END_FILE2;
'GP_MASA_01F04_c',681,'ACCACACATCATCTGACTTACGTACGTACG......
'GP_MASA_38C02_c',273,'ACATCCTTCACAGAAGTTTGT.............
'GP_MASA_33B06_c',288,'ACATACTAACACGGTCTTT...............
END_FILE2
my $match = 0;
open my $dataIn, '<', \$File2;
while (<$dataIn>) {
chomp ($_);
my $dataLine = $_;
open my $refIn, '<', \$File1;
while (<$refIn>) {
chomp ($_);
my $str = $_;
if ($dataLine =~ /$str/) {
$match = 1;
}
}
close ($refIn);
if ($match == 0) {
print "$dataLine\n";
}
$match = 0;
}
Prints:
'GP_MASA_01F04_c',681,'ACCACACATCATCTGACTTACGTACGTACG......
Although reparsing the reference file for each line of the data file is exceedingly nasty, the code works. Maybe you can update the sample to demonstrate where you are seeing a problem?
True laziness is hard work
| [reply] [d/l] [select] |
|
|
Ive gone through and put print's throughout the script and ive finaly found out the problem! its the bloody matching string. there is something in my reference file that matching each time it reads the data file
i did a:
if($s_line =~ /$str/){
print "$str -- match\n";
} else {
print "$str -- no match\n";
}
on my test data and it shown that its matching an empty line so there must be something in my reference file thats matching up (ive already checked and its not empty lines). I'm gonna look for it now so i wont keep you hanging on, but cheers for all the improvement suggestions. Im going to go through them all when I have more time and no deadline to catch
cheers for the help!
in heinsight, I should have given better examples at the start. sorry. will do next time | [reply] [d/l] |