2015_newbie has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to create a script that shows lines in one file that are in another file. Like this:
use strict; use strict; use warnings; my %file2; open my $file2, '<', '/tmp/dog.txt' or die "Couldn't open file2: $!"; while ( my $line = <$file2> ) { ++$file2{$line}; } open my $file1, '<', '/tmp/cat.txt' or die "Couldn't open file1: $!"; while ( my $line = <$file1> ) { print $line if $file2{$line}; }
dog.txt ARF police set cat.txt meow arf police
The script correctly returns "police." But the problem is that I have 50 files like this & I need to assign files to a variable so that it can compare one file to another. Does anyone know how to get multiple files in a script like this? so there may be horse.txt and lama.txt and camel.txt and tiger.txt and I want to check each file against dog.txt and cat.txt. Is this possible? I would also like to know if part of a pattern can be compared -such as grep /poli/ - I have only been able to do it by line.

Replies are listed 'Best First'.
Re: comparing multiple files for patterns
by kennethk (Abbot) on Dec 31, 2015 at 01:29 UTC
    Much as Athanasius suggested in Re: how to save patterns found in array to separate files, you can solve this issue using a hash. In particular, you want to use HASHES OF HASHES (as described in perldsc, see also perlref, perlreftut, perllol). Your first index is probably your filename, and your second is the line content:
    use strict; use warnings; my %content; for my $animal ('dog', 'cat') { open my $fh, '<', "/tmp/$animal.txt" or die "Couldn't open $animal +: $!"; while ( my $line = <$fh> ) { ++$content{$animal}{$line}; } } for my $animal ('horse', 'lama', 'camel', 'tiger') { open my $fh, '<', "/tmp/$animal.txt" or die "Couldn't open $animal +: $!"; while ( my $line = <$fh> ) { for my $other ('cat', 'dog') { print "$other: $line" if $content{$other}{$line}; } } }

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: comparing multiple files for patterns
by GrandFather (Saint) on Dec 31, 2015 at 03:19 UTC

    Is this a toy exercise for learning, or have you some non-trivial application? If you have a non-trivial application you really ought to tell us about it because the implementation details depend a great deal on the nature of the application. Things like size of files and length of match strings make a huge difference to how to efficiently implement the search.

    Premature optimization is the root of all job security
Re: comparing multiple files for patterns -- oneliner explained
by Discipulus (Canon) on Dec 31, 2015 at 09:45 UTC
    welcome 2015_newbie

    In the case you are trying to learn new things, cosider the below, somehow long, oneliner: you'll find many things to learn about the power of Perl commandline see perlrun. The oneliner search for occurences of lines in first file given as arguments in all other files.

    Just for a matter of taste i've changed 'police' for 'pretty woman' in your example files..

    perl -lne '%ln;BEGIN{open $f,shift;map{chomp;$ln{$_}++}<$f>}print qq($ +ARGV line\t$.\t[$_]) if exists $ln{$_};close ARGV if eof' dog.txt cat.txt other.txt cat.txt line 2 [pretty woman] other.txt line 1 [pretty woman]
    In details: perl -lne execute the code 'cause -e, -l does autochomp on lines when there is also -l or -n,-n assumes a while loop reading every file passed as arguments. Again see perlrun

    In the oneliner content we have: %ln that put that lines hash into namespace. Is important have it before the following BEGIN block. The BEGIN block executes as soon as possible: so it shift @ARGV (see shift to know why) privating the -n switch of his first argument. That shifted arg is opened and then the list returned by <$f> (the diamond operator return all lines in list context!) is elaborated by map. We are in a BEGIN block so (i suppose) is too early for the -l switch to do his autochomp so in the block we chomp and autoincrement the value of the key $_ (is the current line feeded by <$f>) of the hash %ln.

    Now we are in the main body of the oneliner where -ln are in effect; we print the current filename ($ARGV when using the diamond operator <> see perlvar) the line num $. (again perlvar) and the current line $_ but we print only if exists the corresponding hash entry $ln{$_}

    close ARGV if eof close the special filhandle ARGV(se perlvar) if eof is reached: this is important because $. does not reset in case of implicit close of a filehandle, as in our case. Remove that part to see $. constantly increasing for every file opened.

    Have fun and happy new year (maybe reading Perl White Magic - Special Variables and Command Line Switches)
    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Hoping everyone had a great New Year and thanks for the replies. I started trying some more tactics to find out how to compare columns and lines. I am trying it as an exercise. There are actually not 50 files, but I meant it could be more than 2. Here are actual files I created: first.txt
      /vol/cat,feline /vol/dog,canine /vol/cat,feline /vol/cat,feline /vol/amphibian,FROG /vol/amphibian,FROG
      second.txt
      9,/vol/elephant,fourfeet 1999,/vol/dolphin,fish 10,/vol/cat,feline 1111,/vol/goldfish,fish 2222,/vol/spider,arachnid 5555,/vol/camel,dromedary 3333,/vol/wolf,canine
      I am trying to do the following: 1. select the /vol/cat,feline as the element common in the array - this will require that the first column be excluded. 2. If the /vol/cat,feline is found (if an element is found in common to the array, print the ID # - for example #10. Here is what I did so far:
      use strict; sub get_animal { open my $FILE, '<', shift or die $!; return map {chop; $_ => $_} <$FILE>; } my %a = get_animal '/tmp/first.txt'; my %b = get_animal '/tmp/second.txt'; { print "$_\n" for grep {$_} @a{keys %b}; }
      It works if the column with the numbers in it is deleted from second.txt. I don't know how to make it compare the first and second columns from first file with the second and third columns from the second file. After that, it needs to return the ID # when it finds a match. Any ideas?
        Hello, i cannot fully understand your requirements, given the two example files; can you rephrase?

        Some observations:
        • where is use warnings;?
        • do not use uppercase variables names
        • also rembember to close your filehandles anyway: it is safer.
        • chop is not chomp
        • Avoid a or b as variable name: the scalar form are special variables and even if the hash is not, avoid it anyway. Instead choos meaningfull variables names
        • when learning or debugging i think is preferable to write down plain syntaxes: you have a superfluous bare block: why? the syntax inside it is not so begenner's one. How can inpsect it without a place where insert the basic debugging tool aka print?
        #{ # print "$_\n" for grep {$_} @a{keys %b}; #} # # should be something like (untested..) foreach my $bkey (keys %b) { warn "key not defined" unless $bkey; ## what is the purpose of you +r "grep {$_}"??? if ($a{$bkey}){print "FOUND: [$bkey] in the hash \%a\n"} else{print "NOT found key [$bkey] in the hash \%a\n"} }

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: comparing multiple files for patterns
by Laurent_R (Canon) on Dec 31, 2015 at 18:46 UTC
    If you have 50 of these files and want co compare each file with each other, do you realize how many file comparisons you're going to run? Your computer might happily do it (if your files are not too large), but what are you going to do then with more than a thousand resulting comparisons? Compare each comparison with each other and get half a million results?

    Perhaps you should rethink your actual needs.