So what do you want the output to be? Do you want all three lines from file1 and all six lines from file2? Do you want just the lines with distinct information (maybe counting how many times each distinct line occurs)? Do you want just the distinct values from the "join" column that match in the two files (just "2002" in this case)? Or maybe, for each distinct matching value, how many times it occurs in each file (e.g. "2002 3 6")?
If you want the full lines from each file that have matching values, how do you want to organize them? This is tricky, because it looks like there will be variable numbers of lines from each file for the values that match.
I wrote a simple utility script to compare specific columns in two files, and print the intersection or union or difference of the column values -- I posted it here: cmpcol. Maybe it will give you some ideas on how to tackle your specific task (or maybe it will do the task you want -- I'm not sure...)
I put your sample data into files as indicated, and here are some outputs from cmpcol using those two files as input:
I wrote cmpcol to allow a lot of flexibility in column delimiters -- the string supplied with the "-d" option is passed directly as a regex to "split()", with all magic characters enabled (so in this case, I have to backslash the vertical bar character, to treat it as a literal, not magic).# first example: just print matching "key" values: $ cmpcol -d '\|' -i file1:2 file2:3 0040 052425 052634 053281 055876 2002 # print full lines from file1 that match keys in file2 $ cmpcol -d '\|' -i -l1 file1:2 file2:3 1173|0040 1174|052425 1175|052634 1176|053281 1177|055876 1189|2002 1190|2002 1191|2002 # print full lines of file2 that match keys in file1: $ cmpcol -d '\|' -i -l2 file1:2 file2:3 000|20019|0040|No Definida. 000|20034|052425|No Definida. 000|20014|052634|No Definida. 000|20031|053281|No Definida. 000|20044|055876|No Definida. 210|72059|2002|SERGIO SUAREZ LLAMAS 210|72059|2002|SERGIO SUAREZ LLAMAS 210|72059|2002|SERGIO SUAREZ LLAMAS 210|20023|2002|SERGIO SUAREZ LLAMAS 210|72057|2002|SERGIO SUAREZ LLAMAS 210|67013|2002|SERGIO SUAREZ LLAMAS # relate full matching lines from both files # (note extra lines from file2 at bottom, matching "2002"): $ cmpcol -d '\|' -i -lb file1:2 file2:3 1173|0040:<>:000|20019|0040|No Definida. 1174|052425:<>:000|20034|052425|No Definida. 1175|052634:<>:000|20014|052634|No Definida. 1176|053281:<>:000|20031|053281|No Definida. 1177|055876:<>:000|20044|055876|No Definida. 1189|2002:<>:210|72059|2002|SERGIO SUAREZ LLAMAS 1190|2002:<>:210|72059|2002|SERGIO SUAREZ LLAMAS 1191|2002:<>:210|72059|2002|SERGIO SUAREZ LLAMAS :<>:210|20023|2002|SERGIO SUAREZ LLAMAS :<>:210|72057|2002|SERGIO SUAREZ LLAMAS :<>:210|67013|2002|SERGIO SUAREZ LLAMAS # same as previous, but only use uniq lines from file2: $ sort -u file2 | cmpcol -d '\|' -i -lb file1:2 stdin:3 1173|0040:<>:000|20019|0040|No Definida. 1174|052425:<>:000|20034|052425|No Definida. 1175|052634:<>:000|20014|052634|No Definida. 1176|053281:<>:000|20031|053281|No Definida. 1177|055876:<>:000|20044|055876|No Definida. 1189|2002:<>:210|20023|2002|SERGIO SUAREZ LLAMAS 1190|2002:<>:210|67013|2002|SERGIO SUAREZ LLAMAS 1191|2002:<>:210|72057|2002|SERGIO SUAREZ LLAMAS :<>:210|72059|2002|SERGIO SUAREZ LLAMAS
When full lines are output from both files, the string ":<>:" is used to mark the division between the two files, because this is generally bound to be distinctive and unmistakable. (Maybe I should add an option to control that, but you can just pipe the output through "sed" or a perl one-liner to make it whatever you want.)
Hope that helps.
In reply to Re: How compare two files
by graff
in thread How compare two files
by Daredevil--
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |