comment on

Part of the problem could be that you haven't specified the goal completely. You say you want to compare the second column of file1 with the third column of file2 and "get a new file" where those fields are the same. In your two data file examples, "2002" shows up in three rows of file1 (but they all have distinct values in the first column), and in six rows of file2 (but three of these rows are identical, and the other three have distinct values in the second column).

So what do you want the output to be? Do you want all three lines from file1 and all six lines from file2? Do you want just the lines with distinct information (maybe counting how many times each distinct line occurs)? Do you want just the distinct values from the "join" column that match in the two files (just "2002" in this case)? Or maybe, for each distinct matching value, how many times it occurs in each file (e.g. "2002 3 6")?

If you want the full lines from each file that have matching values, how do you want to organize them? This is tricky, because it looks like there will be variable numbers of lines from each file for the values that match.

I wrote a simple utility script to compare specific columns in two files, and print the intersection or union or difference of the column values -- I posted it here: cmpcol. Maybe it will give you some ideas on how to tackle your specific task (or maybe it will do the task you want -- I'm not sure...)

I put your sample data into files as indicated, and here are some outputs from cmpcol using those two files as input:

# first example: just print matching "key" values:

$ cmpcol -d '\|' -i file1:2 file2:3
0040
052425
052634
053281
055876
2002

# print full lines from file1 that match keys in file2

$ cmpcol -d '\|' -i -l1 file1:2 file2:3
1173|0040
1174|052425
1175|052634
1176|053281
1177|055876
1189|2002
1190|2002
1191|2002

# print full lines of file2 that match keys in file1:

$ cmpcol -d '\|' -i -l2 file1:2 file2:3
000|20019|0040|No Definida.
000|20034|052425|No Definida.
000|20014|052634|No Definida.
000|20031|053281|No Definida.
000|20044|055876|No Definida.
210|72059|2002|SERGIO SUAREZ LLAMAS
210|72059|2002|SERGIO SUAREZ LLAMAS
210|72059|2002|SERGIO SUAREZ LLAMAS
210|20023|2002|SERGIO SUAREZ LLAMAS
210|72057|2002|SERGIO SUAREZ LLAMAS
210|67013|2002|SERGIO SUAREZ LLAMAS

# relate full matching lines from both files
# (note extra lines from file2 at bottom, matching "2002"):

$ cmpcol -d '\|' -i -lb file1:2 file2:3
1173|0040:<>:000|20019|0040|No Definida.
1174|052425:<>:000|20034|052425|No Definida.
1175|052634:<>:000|20014|052634|No Definida.
1176|053281:<>:000|20031|053281|No Definida.
1177|055876:<>:000|20044|055876|No Definida.
1189|2002:<>:210|72059|2002|SERGIO SUAREZ LLAMAS
1190|2002:<>:210|72059|2002|SERGIO SUAREZ LLAMAS
1191|2002:<>:210|72059|2002|SERGIO SUAREZ LLAMAS
:<>:210|20023|2002|SERGIO SUAREZ LLAMAS
:<>:210|72057|2002|SERGIO SUAREZ LLAMAS
:<>:210|67013|2002|SERGIO SUAREZ LLAMAS

# same as previous, but only use uniq lines from file2:

$ sort -u file2 | cmpcol -d '\|' -i -lb file1:2 stdin:3
1173|0040:<>:000|20019|0040|No Definida.
1174|052425:<>:000|20034|052425|No Definida.
1175|052634:<>:000|20014|052634|No Definida.
1176|053281:<>:000|20031|053281|No Definida.
1177|055876:<>:000|20044|055876|No Definida.
1189|2002:<>:210|20023|2002|SERGIO SUAREZ LLAMAS
1190|2002:<>:210|67013|2002|SERGIO SUAREZ LLAMAS
1191|2002:<>:210|72057|2002|SERGIO SUAREZ LLAMAS
:<>:210|72059|2002|SERGIO SUAREZ LLAMAS
[download]

I wrote cmpcol to allow a lot of flexibility in column delimiters -- the string supplied with the "-d" option is passed directly as a regex to "split()", with all magic characters enabled (so in this case, I have to backslash the vertical bar character, to treat it as a literal, not magic).

When full lines are output from both files, the string ":<>:" is used to mark the division between the two files, because this is generally bound to be distinctive and unmistakable. (Maybe I should add an option to control that, but you can just pipe the output through "sed" or a perl one-liner to make it whatever you want.)

Hope that helps.

In reply to Re: How compare two files by graff
in thread How compare two files by Daredevil--

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.