find common lines in many files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: find common lines in many files by Corion (Patriarch) on Jul 11, 2011 at 09:33 UTC
If you have a solution that works for two files, all you have to do is apply that solution to a third file and the common lines of the first two, and so on until you have no more files. Once you understand what `$seen{$_} .= @ARGV` does, it shouldn't be too hard to extend it to more than two files. It shouldn't be too hard if you look at the output of the following command: `> perl -wnle "print qq(Files remaining ) . @ARGV; print $seen{$_} .= @ +ARGV; " file1 file2 file3` [download] If your lines are not unique within a file, you'll have to define what should happen. Personally, I think a better approach would be to keep a list of common lines and reduce that list for each file, or maybe just use the `uniq` tool. But depending on your needs, the approach might need to be different, for example if the order of lines is important. My approach would be (see perlfaq4 for finding the intersection of two arrays): `#!perl -w use strict; my %seen; sub intersect { # see perlfaq4 }; my $first_file = shift @ARGV; my @common = read_file($first_file); for (@ARGV) { @common = intersect( \@common, [ read_file $_ ] ); }; print "$_\n" for @common;` [download]	[reply] [d/l] [select]
Re: find common lines in many files by happy.barney (Friar) on Jul 11, 2011 at 09:27 UTC
Is this what you want? `my %common; my ($first, @files) = @ARGV; @common{ do { local @ARGV = ($first); <> } } = (); for my $file (@files) { my %this; local @ARGV = ($file); @this{ grep exists $common{$_}, <> } = (); delete @common{ grep ! exists $this{$_}, keys %common}; } print keys %common if keys %common;` [download]	[reply] [d/l]
Re: find common lines in many files by BrowserUk (Patriarch) on Jul 11, 2011 at 09:23 UTC
So something must have gone wrong here... ~~If one file contains more than one instance of any given line, then your count will reach ten even if it is only seen in 9 files. (BTW: Why test for numeric equality with a regex?)~~ For upto 32 files (Or 64 if you have 64-bit ints) then you could use something like this (untested): `#! perl -slw use strict; my $n = 2 ** @ARGV - 1; my %hash; for my $i ( 0 .. $#ARGV ) [ open my $in, '<', $ARGV[ $i ] or die "$ARGV[ $i ] : $!"; while( <$in> ) { chomp; ( vec( $hash{ $_ }, 1, $i ) \|= 1 ) == $n and print; } }` [download] It's not easy to cast that as a one-liner because of the need to know which input file you are dealing with. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: find common lines in many files by BrowserUk (Patriarch) on Jul 11, 2011 at 09:45 UTC
Update: Corion points out that you aren't counting. I think your one-liner might work if you modify your regex to be: `perl -ne 'print if ($seen{$_} .= @ARGV) =~ /^9876543210$/' FILE1 FILE2` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: find common lines in many files by jethro (Monsignor) on Jul 11, 2011 at 09:56 UTC
No, because if lines occur more than once in a file you get strings like "9987655433321"	[reply]
Re^4: find common lines in many files by BrowserUk (Patriarch) on Jul 11, 2011 at 10:13 UTC
Re: find common lines in many files by jethro (Monsignor) on Jul 11, 2011 at 09:53 UTC
Let me tell you how it works: It constructs a string for every distinct line in a file, appending the number of files still to process to the string (it doesn't really matter what character is appended, only important is that it is different for every file) I.e. whenever line "xyz" occurs in the first file, it appends a '1' to the string for this line, because there is still one file to process (in case of two files). When "xyz" occurs in the second file, it always appends '0', because there is no file in the queue still to process Whenever, after appending a character, the string for line "xyz" has the contents "...01", i.e. a '1' at the end of the string, and before that at least one '0', then it prints the line. Which happens exactly once, when the line "xyz" is found the first time in the second file. Before that there was no '1' in the string. After that there will be multiple '1's at the end of the string This is an ingenious oneliner. But when you have more than two files to compare, this script will search through all the files, but still only print the common lines in the last two files. To generalize it to more files, you would have to search if all numbers occur in the string and it wouldn't work for more than 10 files. `perl -ne '$seen{$_}= " " x (@ARGV+1) if (not exists $seen{$_}); substr +($seen{$_},@ARGV,1)="1"; if (@ARGV==0 and $seen{$_}=~ /^1+$/) { print +; $seen{$_}.="x"}' FILES ...` [download] This oneliner works with any amount of files. It doesn't append but changes a ' ' to a '1' at the position in the string corresponding to the file. It also has to initialize the string the first time. And has to invalidate the string (by appending an 'x') after printing it so that lines occuring twice in the last file don't get printed twice. The lines get printed when the last file is processed and the string is all '1's PS: Doing things in the shell will almost always be slower than in perl	[reply] [d/l]
Re: find common lines in many files by planetscape (Chancellor) on Jul 11, 2011 at 10:38 UTC
See also: general advice finding duplicate code HTH, planetscape	[reply]
Re: find common lines in many files by JavaFan (Canon) on Jul 11, 2011 at 11:21 UTC
`#!/usr/bin/perl use strict; use warnings; exit unless @ARGV; my %cache = do {local @ARGV = shift @ARGV; map {($_, 1)} <>}; while (@ARGV) { local @ARGV = shift @ARGV; %cache = map {$cache{$_} ? ($_, 1) : ()} <>; } print for keys %cache; __END__` [download] Call it with the list of file names you want to find common lines in.	[reply] [d/l]
Re: find common lines in many files by amcglinchy (Novice) on Jul 11, 2011 at 13:51 UTC
To keep track of which file you are reading use the $ARGV variable. See perlop manual page for more information on the powerful <> 'diamond' operator. Assuming you are dealing with files that can comfortably sit in memory I suggest a simple hash of hash to keep track of which line was seen by which file. `use strict; use warnings; my $file_count=@ARGV; my %seen; while(<>) { # File $ARGV saw line $_ at least once by file $ARGV $seen{$_}{$ARGV}++; } while( my ($line, $by_whom) = each %seen) { if ( $file_count == keys %$by_whom) { # Every file saw this line at least once print $line } }` [download] If you want this as a one-liner you can use the -n flag to wrap a diamond operator around the central loop. This leads to the slightly obfuscated one liner with plenty of scope for golfing `perl -ne 'BEGIN{$count = @ARGV}; $seen{$_}{$ARGV}=1; END{while( ($line +, $by_whom) = each %seen){ print $line if $count == keys %$by_whom}} +'` [download]	[reply] [d/l] [select]
Re: find common lines in many files by i5513 (Pilgrim) on Jul 11, 2011 at 13:37 UTC
I would make it with a bit of bash pipelining: - sort for sorting file by file - uniq for only count one line by file - sort again to be sure only interesting lines will be printed (with the help of uniq -c) - perl to print interesting lines `files=(MYFILES/.txt) for f in ${files[]} do sort -u $f ; done \| sort \| uniq -c \| perl -ne 'if (/[[:space:]]+'"$((${#files}+1)) "'(.*)/) {print $1."\n"; +}'` [download] Regards,	[reply] [d/l]