Renyulb28 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I need aid in coding a script for excluding the entire row of a text file if the absolute difference between two columns' numerical quantities exceeds or equals a certain number. For example; name 10 9 If absolute value of column 2 (10) minus column 3 (9) is greater than or equal to 1, then remove the entire row. I am a perl newbie and know to remove a row with a certain string I can use grep -v "string" file > newfile but I have no idea how to add in numerical functions such as this. Thanks for any help
  • Comment on Remove row if the absolute difference between two columns is greater than a threshold

Replies are listed 'Best First'.
Re: Remove row if the absolute difference between two columns is greater than a threshold
by TomDLux (Vicar) on Feb 15, 2011 at 17:08 UTC

    grep -v is an acceptable Unix command line method to exclude a line, but it does not qualify as a Perl solution ... at least not as a GOOD perl solution. Besides which, that would drop only one line from the file. Imagine a worst case in which you wind up excluding every line in a 1000 line file ... You would have to copy the file, sans one line, 1000 times.

    What you want is to go through the file, line by line, using open(), while(), and close(), test each line, and if acceptable, copy it to the output file. That means only one copying of the file, whether you drop 0 lines or a million.

    You say "the absolute value of column 2 minus column 2 ( I guess you mean column 3 ) is greater than or equal to 1". Except for the absolute value bit, I would test for $col2 > $col3. But it's significantly different whether you mean abs( $col2 ) > abs( $col3 ) or whether you mean abs( $col2 - $col3 ).

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      thank you for the reply. I do mean abs(column 2 - column 3). I would like the script to be able to either remove those rows in which that absolute value is greater than or equal to 1.
Re: Remove row if the absolute difference between two columns is greater than a threshold
by fidesachates (Monk) on Feb 15, 2011 at 17:30 UTC
    The poster above has given you a very good logic flow and design for your program. I'll provide a little more on the functions you might want to use.

    open(); #look up the proper syntax for using the open function to #open the file while(<FILEHANDLE>) { my $line = $_; #I always prefer to copy $_ into an actual named #variable. Personal preference. Some other monk please #correct me if there is a best practice for this. }
    With the variable $line, you will want to look at the split() function. This will help you separate out the columns in each line. Also take a look at chomp if one of the columns is at the end of the line. Once you have the columns, abs will help with retrieving the absolute values. Finally, if the column matches your criteria, just print the variable $line. Afterwards, just run your program and redirect to the textfile of your choice.

    Happy coding!


    N.B. the code I posted has not been tested and thus prone to typos.
Re: Remove row if the absolute difference between two columns is greater than a threshold
by BrowserUk (Patriarch) on Feb 15, 2011 at 18:57 UTC

    This should do it. See perlrun for the details:

    perl -anle"$F[1]==$F[2] and print" infile > outfile

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Remove row if the absolute difference between two columns is greater than a threshold
by ack (Deacon) on Feb 15, 2011 at 17:48 UTC

    Here's a short script that I think does what you're after. I didn't spend any time optimizing it; so it is just so you can see a quick and dirty strategy.

    It uses Perl references to create 2 dimensional matrices and the arrow notation to simplify and clarify what is going on. The subrouting, printMatrix(), is just for convenience so that you can better see what the 'before' and 'after' situation looks like.

    The output from the little script is:

    Good luck; welcome to Perl.

    ack Albuquerque, NM
Re: Remove row if the absolute difference between two columns is greater than a threshold
by suhailck (Friar) on Feb 16, 2011 at 07:05 UTC
    perl -lane 'print if abs($F[1] - $F[2]) >= 1' infile > outfile
Re: Remove row if the absolute difference between two columns is greater than a threshold
by locked_user sundialsvc4 (Abbot) on Feb 15, 2011 at 18:36 UTC

    There is, in fact, a grep function.

    See:

    • perldoc perlfunc
    • perldoc -f grep
    • perldoc -f map

    Incidentally, since lists usually contain references to the things that they “contain,” I often design filtering-routines so that they scan through the input list, selecting what they want to keep and pushing those onto an output list, which is then returned.   Since we’re only moving references around, we aren’t burning up memory.   And, the process is non-destructive:   at the end of the day, we have the output list but the input list hasn’t actually been touched.   We can now, if we choose, discard the one and keep the other, or we can keep both.