comparing multiple files for patterns

2015_newbie has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: comparing multiple files for patterns by kennethk (Abbot) on Dec 31, 2015 at 01:29 UTC
Much as Athanasius suggested in Re: how to save patterns found in array to separate files, you can solve this issue using a hash. In particular, you want to use HASHES OF HASHES (as described in perldsc, see also perlref, perlreftut, perllol). Your first index is probably your filename, and your second is the line content: `use strict; use warnings; my %content; for my $animal ('dog', 'cat') { open my $fh, '<', "/tmp/$animal.txt" or die "Couldn't open $animal +: $!"; while ( my $line = <$fh> ) { ++$content{$animal}{$line}; } } for my $animal ('horse', 'lama', 'camel', 'tiger') { open my $fh, '<', "/tmp/$animal.txt" or die "Couldn't open $animal +: $!"; while ( my $line = <$fh> ) { for my $other ('cat', 'dog') { print "$other: $line" if $content{$other}{$line}; } } }` [download] #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l]
Re: comparing multiple files for patterns by GrandFather (Saint) on Dec 31, 2015 at 03:19 UTC
Is this a toy exercise for learning, or have you some non-trivial application? If you have a non-trivial application you really ought to tell us about it because the implementation details depend a great deal on the nature of the application. Things like size of files and length of match strings make a huge difference to how to efficiently implement the search. Premature optimization is the root of all job security	[reply]
Re: comparing multiple files for patterns -- oneliner explained by Discipulus (Canon) on Dec 31, 2015 at 09:45 UTC
welcome 2015_newbie In the case you are trying to learn new things, cosider the below, somehow long, oneliner: you'll find many things to learn about the power of Perl commandline see perlrun. The oneliner search for occurences of lines in first file given as arguments in all other files. Just for a matter of taste i've changed 'police' for 'pretty woman' in your example files.. `perl -lne '%ln;BEGIN{open $f,shift;map{chomp;$ln{$_}++}<$f>}print qq($ +ARGV line\t$.\t[$_]) if exists $ln{$_};close ARGV if eof' dog.txt cat.txt other.txt cat.txt line 2 [pretty woman] other.txt line 1 [pretty woman]` [download] In details: `perl -lne` execute the code 'cause `-e`, `-l` does autochomp on lines when there is also `-l or -n`,`-n` assumes a `while` loop reading every file passed as arguments. Again see perlrun In the oneliner content we have: `%ln` that put that lines hash into namespace. Is important have it before the following `BEGIN` block. The `BEGIN` block executes as soon as possible: so it `shift` `@ARGV` (see shift to know why) privating the `-n` switch of his first argument. That shifted arg is opened and then the list returned by `<$f>` (the diamond operator return all lines in list context!) is elaborated by `map`. We are in a `BEGIN` block so (i suppose) is too early for the `-l` switch to do his autochomp so in the block we `chomp` and autoincrement the value of the key `$_` (is the current line feeded by `<$f>`) of the hash `%ln`. Now we are in the main body of the oneliner where `-ln` are in effect; we print the current filename (`$ARGV` when using the diamond operator `<>` see perlvar) the line num `$.` (again perlvar) and the current line `$_` but we print only if exists the corresponding hash entry `$ln{$_}` `close ARGV if eof` close the special filhandle `ARGV`(se perlvar) if eof is reached: this is important because `$.` does not reset in case of implicit close of a filehandle, as in our case. Remove that part to see `$.` constantly increasing for every file opened. Have fun and happy new year (maybe reading Perl White Magic - Special Variables and Command Line Switches) L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: comparing multiple files for patterns -- oneliner explained by 2015_newbie (Novice) on Jan 04, 2016 at 03:21 UTC
Hoping everyone had a great New Year and thanks for the replies. I started trying some more tactics to find out how to compare columns and lines. I am trying it as an exercise. There are actually not 50 files, but I meant it could be more than 2. Here are actual files I created: first.txt `/vol/cat,feline /vol/dog,canine /vol/cat,feline /vol/cat,feline /vol/amphibian,FROG /vol/amphibian,FROG` [download] second.txt `9,/vol/elephant,fourfeet 1999,/vol/dolphin,fish 10,/vol/cat,feline 1111,/vol/goldfish,fish 2222,/vol/spider,arachnid 5555,/vol/camel,dromedary 3333,/vol/wolf,canine` [download] I am trying to do the following: 1. select the /vol/cat,feline as the element common in the array - this will require that the first column be excluded. 2. If the /vol/cat,feline is found (if an element is found in common to the array, print the ID # - for example #10. Here is what I did so far: `use strict; sub get_animal { open my $FILE, '<', shift or die $!; return map {chop; $_ => $_} <$FILE>; } my %a = get_animal '/tmp/first.txt'; my %b = get_animal '/tmp/second.txt'; { print "$_\n" for grep {$_} @a{keys %b}; }` [download] It works if the column with the numbers in it is deleted from second.txt. I don't know how to make it compare the first and second columns from first file with the second and third columns from the second file. After that, it needs to return the ID # when it finds a match. Any ideas?	[reply] [d/l] [select]
Re^3: comparing multiple files for patterns by Discipulus (Canon) on Jan 04, 2016 at 09:03 UTC
Hello, i cannot fully understand your requirements, given the two example files; can you rephrase? Some observations: where is `use warnings;`? do not use uppercase variables names also rembember to close your filehandles anyway: it is safer. chop is not chomp Avoid a or b as variable name: the scalar form are special variables and even if the hash is not, avoid it anyway. Instead choos meaningfull variables names when learning or debugging i think is preferable to write down plain syntaxes: you have a superfluous bare block: why? the syntax inside it is not so begenner's one. How can inpsect it without a place where insert the basic debugging tool aka print? `#{ # print "$_\n" for grep {$_} @a{keys %b}; #} # # should be something like (untested..) foreach my $bkey (keys %b) { warn "key not defined" unless $bkey; ## what is the purpose of you +r "grep {$_}"??? if ($a{$bkey}){print "FOUND: [$bkey] in the hash \%a\n"} else{print "NOT found key [$bkey] in the hash \%a\n"} }` [download] L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re: comparing multiple files for patterns by Laurent_R (Canon) on Dec 31, 2015 at 18:46 UTC
If you have 50 of these files and want co compare each file with each other, do you realize how many file comparisons you're going to run? Your computer might happily do it (if your files are not too large), but what are you going to do then with more than a thousand resulting comparisons? Compare each comparison with each other and get half a million results? Perhaps you should rethink your actual needs.	[reply]