File Intersection problem

ashnator has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am facing a peculiar kind of a problem in printing the common text between the 2 file:-
I tried the Shell commands like

 
sdiff
diff
gvimdiff 
etc but no success
[download]

Please help me bcoz its urgent.
My files look like this:-

File 1:-
Harry 21
Jeff  45
Rob   78
Mett  34
Ann   17
Gilli 39
DOn   98
Ben   15
Harry 54
Rose 46
Ness 65

File 2:-

2133 32 45 45 CC Old Harry (D) (28%) 21 + -1 Rob (D) (31%) 78 + -1 Met
+t (D) (14%) 34 + -1
5789 78 66 32 DD Young Gilli (D) (10%) 39 + -1 Don (D) (66%) 98 + -1 M
+ett (D) (23%) 15 + +1
9027 56 77 29 GG Old Harry (D) (10%) 54 + -1 Rose (D) (%) 46 + -1 Ness
+ (D) (67%) 65 + 1
[download]

I have to get my results like this:-
If suppose Harry, Rob and Met are present 
2133 32 45 45 CC Old Harry (D) (28%) 21 + -1 Rob (D) (31%) 78 + -1 Met
+t (D) (14%) 34 + -1    3
3 since all the are matching      

If only Gilli and  Don are present in File 1 then 
5789 78 66 32 DD Young Gilli (D) (10%) 39 + -1 Don (D) (66%) 98 + -1  
+   2
2 since only 2 are matching

9027 56 77 29 GG Old Harry (D) (10%) 54 + -1     1
Since only  1 is matching with File 1
[download]

I am new to scripting so donno how to approach the problem.
I would be obliged to get help from Monks
Bravo ... Here is my code but not getting the output yet :(

#!/usr/bin/perl -w


open(FH, "<File1.txt") || die "Cannot open file";

%href;
while( FH> )
{
    chomp($_); 
    $href{$1} = $2 if $_ =~ /(\S+)\s+(\S+)/;
}
while (my ($key, $value) = each(%href)) {

              #print $key. ", ". $value."\n";

      }
close FH;

open(FD, "<File2.txt") || die "Cannot open file";

while(<FD>)
   {
           chomp;

           next unless ( s{  \s+(\w+)\s+\([A-Z]\)\s+\(\d*%\)\s+\d\s+\+
+\s[+-]?\d+}{}xms and exists( $href{$1} ));
           my $name = $1;
           print "$name\t\t$href{$name}\n";
           #@_=split('\t',$_);
   }
[download]

Thanks

Comment on File Intersection problem Select or Download Code

Replies are listed 'Best First'.
Re: File Intersection problem by johngg (Canon) on Nov 12, 2008 at 16:39 UTC
To get you started, have a look at open, readline and close for your file handling. The three-argument form of open using lexical filehandles and testing for success (mention `$!`, O/S error, on failure) is recommended practice. E.g. `my $inputFile = q{xyz.txt}; open my $inputFH, q{<}, $inputFile or die qq{open: < $inputFile: $!\n}; my $outputFile = q{abc.txt}; open my $outputFH, q{>}, $outputFile or die qq{open: > $outputFile: $!\n};` [download] Read your files in a while loop line by line, either into the default pattern space, `$_` (see perlvar) or a lexical scalar. Consider whether to remove line terminators using chomp. E.g. `# Reading lines into $_ in a while loop and removing # line terminator while( <$inputFH> ) { chomp; # Acts on $_ by default ... # Rest of your line-processing code here }` [download] Extracting the data fields could be done using regular expressions, have a look at perlretut and perlre. Consider using global matching in File 2 to extract a series of name data into an array. This pattern (not tested) looks like it will match data for one name, the whole data including the name captured in `$1`, the name alone in `$2` `my $rxExtractNameData = qr {(?x) ( \s+ (\w+) \s+ $[A-Z]$ \s+ $\d*%$ \s+ \d+\s+\+\s+[+-]?\d+ ) };` [download] Have a crack at putting some of these ideas together. Build your script gradually, first just getting it to open and close File 1 and, when that works, read the lines and extract the names (and numbers if necessary, do the numbers in File 1 have to match those in File 2?). If you encounter difficulties come back to the Monastery with more questions. I hope these thoughts are helpful Cheers, JohnGG	[reply] [d/l] [select]
Re^2: File Intersection problem by ashnator (Sexton) on Nov 13, 2008 at 08:26 UTC
Please check my updated code and suggest modifications bcoz i am not getting the output :(	[reply]
Re^3: File Intersection problem by johngg (Canon) on Nov 13, 2008 at 22:41 UTC
A few points about your updated code Although you are checking for the success of your open statements your error messages are pretty useless as they don't say which file you were trying to open when the failure occurred and they don't give any indication of the o/s error (see `$!` or `$OS_ERROR` in perlvar) that might have caused the failure. chomp defaults to operating on `$_` so your `chomp($_);` can simply be written `chomp;` to save you some typing. Having read your `File1.txt` you have a look at the hash you have created by iterating over the key/value pairs using each in a while loop (your `print` statement is actually commented out so I guess you just used this for debugging). That works for simple hashes but quickly becomes unwieldy when data structures are more complicated. The Data::Dumper module is part of the standard Perl distribution and is an invaluable tool if you want to examine your data. You do a regex substitution along these lines, `s{...}{}xms` but you used a pattern that is one contiguous string. The whole point of the `x` flag is to allow you to intersperse spaces (and comments if you wish) in the pattern to make it more readable and easily understood. Judging by the desired output in your OP, my understanding of your requirement is this Read and parse `File1.txt` to obtain a list of names (and, possibly, associated numbers) used to filter `File2.txt` Read and process `File2.txt` line by line For each line: Extract the preamble preceding the first set of name-associated data and print it without a newline Extract each set of name-associated data, determine the name contained in that data and only print that data if the name occurs in the list, again no newline When all of the name-associated data groups have been processed print a newline When you start processing lines in `File2.txt` you rather jump the gun by substituting the first name-associated data group by nothing before you know whether you actually need to keep it and also before you do anything with the preamble. The need to identify the preamble as well as extracting each data set leads me to slightly reconsider the compiled regular expression (see Regexp Quote Like Operators in perlop) I gave in my earlier reply. I would remove the second capture around the name to become `my $rxExtractNameData = qr {(?x) ( \s+ \w+ \s+ $[A-Z]$ \s+ $\d%$ \s+ \d+\s+\+\s+[+-]?\d+ ) };` [download] which would allow me to use it both for the preamble and the data groups `... while( <$file2FH> ) { # reject line unless it has name data next unless m{^(.?)$rxExtractNameData}; print $1; # preamble captured in $1 # global match to extract to an array my @dataGroups = m{$rxExtractNameData}g; ... } ...` [download] Once you have the name-associated data groups you can loop over them pulling out each name, test whether it is in the hash parsed from `File1.txt` and, if so, print the data group. I hope that I have correctly understood your requirements and that these thought will be useful. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: File Intersection problem by brsaravan (Scribe) on Nov 13, 2008 at 11:01 UTC
Try this code. `open(FD, "File2.txt") \|\| die "Cannot open file"; my @input_array = <FD>; foreach my $line (@input_array) { chomp($line); my $cnt = 0; map {($line =~ /$_/)?++$cnt:$cnt}keys %href; print "$line $cnt\n"; }` [download]	[reply] [d/l]
Re: File Intersection problem by gone2015 (Deacon) on Nov 12, 2008 at 16:18 UTC
The problem appears to boil down to: opening the file containing the names you wish to match, and reading that into some data structure that you can use to match with. Perl hashes good at exact matches. opening the other file, and reading it line by line. parse the line in some way to identify the names it contains. You may be able to split it, or you may need to use a regular expression. see how many (if any) of those names are in the collection of names you want to match. decide what to output, and do that. I imagine that matching names may be small challenge. Does, for example 'harry' match either 'Harry' or, indeed 'Old Harry' ? But otherwise, this looks straightforward enough. What have you tried ?	[reply]
Re: File Intersection problem by toolic (Bishop) on Nov 12, 2008 at 16:22 UTC
Here is one algorithm: Read file 1 using open Read each line using `while` split each line on whitespace in to Name and Number variables Store those variables in a hash with Name as the key and Number as the value close file 1 Read file 2 split each line into a new array compare each item of the new array with each key of the hash, incrementing a counter each time there is a match print the number of matches for the line close file 2 This may not be the most efficient algorithm, but you can try it, then post your code; you are sure to get constructive criticism on code that you post.	[reply] [d/l]
Re: File Intersection problem by JavaFan (Canon) on Nov 12, 2008 at 16:18 UTC
I'd read file one, store the names in a hash, then read the second file, for each line, split the line into four parts (a prefix, and three parts for each name). If any of the names in the three latter parts match, print the prefix, and each of the matching name parts. If you have specific problems how to do any of the subtasks, feel free to write what you have tried, and why it isn't working for you!	[reply]