in reply to Re: comparing columns using regular expression
in thread comparing columns using regular expression

Everything makes sense but how do I look up the values of field 2 in file1? As you can see, the number of fields in each line are different. How do you know for sure that you're getting the correct field? Here are the facts: The first column has only word(s)(because they are names/titles). The second and third column will have mix of numbers/letters. Using regex, I though there is a way to isolate the 2nd column by: (starts w/ any comb of word characters and \s)2nd col(ends with space and numbers/letters only) I will try it. If i get it i'll post it. If there is a another way, please let me know. Thanks!
  • Comment on Re^2: comparing columns using regular expression

Replies are listed 'Best First'.
Re^3: comparing columns using regular expression
by ELISHEVA (Prior) on Jan 20, 2011 at 05:48 UTC

    Put another way, you aren't sure about how to split the line into fields? Once you have the line split up into fields, you just extract the second element of the array, i.e. $aFields[1] and see if it is in the hash.

    How you split up the lines depends heavily on the syntax/grammar of your file. Do you know what that is? You seem a bit uncertain. Are these "fields" from a row of a database or a generated report? Does the file have a documented format? Or are these actually just words in a line?

    If these are just words, you could safely split the line on whitespace, like this

    $line =~ s/^\s+//; # strip leading whitespace from the line $line =~ s/\s+\z//; # strip trailing whitespace from the line my @aFields = split(/\s+/,$line); #extract words

    However, the above won't work if you have whitespace inside a field because it will split the field in two. If on some line of the file, the first column contains three words, then column 2 would end up in the fourth array element and you'd never know.

    What are the rules that actually govern the organization of this file into rows and fields? To know whether or not you need to use regex's you first need to know the file's rules. Regex's aren't always the best solution. thezip has pointed out that unpack would be a better way to split the line into fields if you are dealing with fixed width columns. (each column/field has known-in-advance number of characters.).

    On the other hand, if fields are separated from one another using a separator string or character, regex's are often good for splitting such lines into fields. However, you won't know what regex to use without knowing the format. Rules for separator delimited fields can be very simple or complex.

    A simple rule might be "a tab always means column separator". If that was your rule, you could just use split("\t",$line) to break up the line into fields.

    Or it could be more complicated - columns are separated by whitespace except where the whitespace is quoted or escaped. Or it could be even more complicated: the first character of each line determines the field separator for the rest of the line, plus there is an escaping/quoting mechanism. The possibilities are endless. It would be hard to advise you without knowing the intended rules of the file.

    Update: various clarifications and rewordings.

      Thanks! This simiplifies a lot of things for me. I will look into it.