in reply to comparing columns using regular expression

File2 is a subset(w/ some repetitions) of the second column from file1. I need to print the entire line from file1 that does NOT contain the rows in file2

I also question why you would want to use regex's, even if the file doesn't have fixed width fields and only appears that way to make your post look pretty. If file 2 stores the whole value of column 2, why not just store the values of file 2 in a hash? Then when you read file 1, split the row into fields, and look up the value of field 2 in a hash to see if you want the line.

my @aKeep; while (my $line = <F1>) { # split line into fields - depends on how the line is formatted @aFields = .... if (!exists($hValuesInColumn2{$aFields[1]}) { push @aKeep, $line; } }

A few efficiency comments as well. There is no need to explicitly "unique" the second file if you use a hash to store the field values. Since a hash key is stored only once it automatically "uniques" the keys.

Also when reading in the lines of a file, it is better to use while than foreach. foreach slurps in all of the lines into an array and then visits each, whereas while reads the lines in one at a time and is much more memory efficient.

my %hValuesInColumn2; # use while, not foreach while (my $line = <F2>) { #get rid of trailing new line chomp $line; # strip remaining leading and trailing whitespace # (if that is an issue) $line =~ s/^\s+//; #leading $line =~ s/\s+$//; #trailing # record field value $hValuesInColumn2->{$line} = 1; } # extracting the keys may not even be necessary # see first part of this answer. my @unique_f = keys %hValuesInColumn2;

Update: reordered paragraphs to put focus on OP's original question rather than efficiency issues.

Replies are listed 'Best First'.
Re^2: comparing columns using regular expression
by rocky13 (Acolyte) on Jan 20, 2011 at 02:43 UTC
    Everything makes sense but how do I look up the values of field 2 in file1? As you can see, the number of fields in each line are different. How do you know for sure that you're getting the correct field? Here are the facts: The first column has only word(s)(because they are names/titles). The second and third column will have mix of numbers/letters. Using regex, I though there is a way to isolate the 2nd column by: (starts w/ any comb of word characters and \s)2nd col(ends with space and numbers/letters only) I will try it. If i get it i'll post it. If there is a another way, please let me know. Thanks!

      Put another way, you aren't sure about how to split the line into fields? Once you have the line split up into fields, you just extract the second element of the array, i.e. $aFields[1] and see if it is in the hash.

      How you split up the lines depends heavily on the syntax/grammar of your file. Do you know what that is? You seem a bit uncertain. Are these "fields" from a row of a database or a generated report? Does the file have a documented format? Or are these actually just words in a line?

      If these are just words, you could safely split the line on whitespace, like this

      $line =~ s/^\s+//; # strip leading whitespace from the line $line =~ s/\s+\z//; # strip trailing whitespace from the line my @aFields = split(/\s+/,$line); #extract words

      However, the above won't work if you have whitespace inside a field because it will split the field in two. If on some line of the file, the first column contains three words, then column 2 would end up in the fourth array element and you'd never know.

      What are the rules that actually govern the organization of this file into rows and fields? To know whether or not you need to use regex's you first need to know the file's rules. Regex's aren't always the best solution. thezip has pointed out that unpack would be a better way to split the line into fields if you are dealing with fixed width columns. (each column/field has known-in-advance number of characters.).

      On the other hand, if fields are separated from one another using a separator string or character, regex's are often good for splitting such lines into fields. However, you won't know what regex to use without knowing the format. Rules for separator delimited fields can be very simple or complex.

      A simple rule might be "a tab always means column separator". If that was your rule, you could just use split("\t",$line) to break up the line into fields.

      Or it could be more complicated - columns are separated by whitespace except where the whitespace is quoted or escaped. Or it could be even more complicated: the first character of each line determines the field separator for the rest of the line, plus there is an escaping/quoting mechanism. The possibilities are endless. It would be hard to advise you without knowing the intended rules of the file.

      Update: various clarifications and rewordings.

        Thanks! This simiplifies a lot of things for me. I will look into it.