Peter Keystrokes has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I am trying to write a script that filters out irrelevant information.

My file contains information in this format:

>hsa_circ_0067224|chr3:128345575-128345675-|NM_002950|RPN1 FORWARD

-4.4.. 6 .. 17 xxxxxxxxxxGTGAC CAGT ATGC ACTG AAGATGAGGTTTGTG

-0.9.. 5 .. 18 xxxxxxxxxxxGTGA CCAGT ATGC ACTGA AGATGAGGTTTGTGG

None.. 1 .. 20 xxxxxxxxxxxxxxx GTGACCAGTATGCACTGAAG ATGAGGTTTGTGGAC

I am trying to filter out all the lines beginning with 'None'. But the tricky part which I don't know how to go about doing is filtering out the lines which begin with a value which is greater than the value which the user has inputted in <STDIN>, because even if I capture the first 4 characters with a regex. This captured value does not directly correspond to a numerical value so I can't compare it to the <STDIN> value for filtering purposes. Is there another way I can go about doing this?
  • Comment on Easiest way to filter a file based on user input

Replies are listed 'Best First'.
Re: Easiest way to filter a file based on user input
by haukex (Archbishop) on Jul 07, 2017 at 10:51 UTC

    Sorry, but I don't understand your description. Could you show a few examples of what the user will be entering on the command line, and for each sample input show which lines should be filtered and which shouldn't?

      So for example, if the user enters -3, all the lines in the file that begin with a numerical value that is greater than 3 will be excluded.

      Or if the user enters the value 1, all lines beginning with a value greater than 1 will be excluded.

        So for example, if the user enters -3, all the lines in the file that begin with a numerical value that is greater than 3 will be excluded. Or if the user enters the value 1, all lines beginning with a value greater than 1 will be excluded.

        So based on that description, the user entering -3 is the same as entering 3?

        #!/usr/bin/env perl use strict; use warnings; print "Enter limit: "; chomp( my $limit = <STDIN> ); $limit = abs($limit); open my $in, '<', "file.hairpin" or die $!; open my $sifted, '>', "new_file.hairpin" or die $!; while (<$in>){ next if /^None/; next if /^(\d+)/ && $1 > $limit; print $sifted $_; } close $in; close $sifted;

        Or as a oneliner (where "123" is the limit):

        perl -ne 'print unless /^None/ || ( /^(\d+)/ && $1>123 )' file.hairpin + >new_file.hairpin

        As for your code here, it looks like you don't need to collect your lines in arrays but can write them to the output file directly (or, at the very least you don't need to open your output file once per line of output).

        Update: I just noticed that the sample input in the OP includes decimals and negative numbers, so you'd have to adjust the regex in my example code above accordingly. But before you try to develop really complex regexes, have a look at Regexp::Common::number.

Re: Easiest way to filter a file based on user input
by 1nickt (Canon) on Jul 07, 2017 at 10:49 UTC

    Hi, please show the code you have tried, reduced to an SSCCE.


    The way forward always starts with a minimal test.
      My sincerest apologies for the amateurish code you're about to see...
      #!/usr/bin/perl use strict; use warnings; print "The lower the score the more stable the structure.", "\n", "Please set a limiting value e.g. -3: ", "\n"; my $value = <STDIN>; open IN, "file.hairpin", or die $!; my @trash; my @treasure; while (<IN>){ if ($_ =~ /^>+/){ push @treasure, $_; }elsif($_ =~ /^None+/){ push @trash, $_; }elsif($_ =~ /(^d+)/){ ## Here I don't know how to incorporate the value I get from the us +er with the value ## in the file }else{ push @treasure, $_; } } close IN; foreach my $stuff (@treasure){ open SIFTED, '>>', "new_file.hairpin", or die $!; print SIFTED, $stuff."\n"; close SIFTED; }

        Hi, thanks for posting your code.

        Here is a version that appears to do what you want. Note the following things:

        • You need to chomp() the user input to remove the newline so you can use the string in a comparison.
        • The sample data you provided doesn't contain anything that would match your first regexp.
        • The regexp for matching the start of the line with the user input crudely and *only* matches negative numbers with exactly one integer and one decimal place. You'll need to change it if the user could enter a positive number, or a negative integer, or anything else.
        • After capturing the match it is available in the special variable $1, which is used for the comparison.
        • I placed your sample data in the script in the __DATA__ section for this demo; it's fine to open and read a file as in your original. I also skipped the writing to an out file.
        • I placed multiple "debug statements" in the code, i.e. printing out things to show what's going on. Once the program is working correctly you can remove those, but it's a good technique for discovering problems in your data processing.
        #!/usr/bin/perl use strict; use warnings; use feature 'say'; print "The lower the score the more stable the structure.", "\n", "Please set a limiting value e.g. -3: ", "\n"; chomp( my $value = <STDIN> ); chomp( my @input = <DATA> ); my @trash; my @treasure; for ( @input ){ if ( /^>+/ ) { say "$_ matches '/^>+/'"; push @treasure, $_; } elsif ( /^None/ ) { say "$_ matches '/^None/'"; push @trash, $_; } elsif( /(^[\d\.-]{4})/ ) { say "$_ matches '/(^[\d\.-]{4})/'"; if ( $1 <= $value ) { say "$1 is <= $value"; push @treasure, $_; } else { say "$1 is > $value"; push @trash, $_; } } else { say "$_ doesn't match anything!"; push @trash, $_; } } say 'Treasure:'; foreach my $stuff ( @treasure ) { say $stuff; } __END__ hsa_circ_0067224|chr3:128345575-128345675-|NM_002950|RPN1 FORWARD -4.4.. 6 .. 17 xxxxxxxxxxGTGAC CAGT ATGC ACTG AAGATGAGGTTTGTG -0.9.. 5 .. 18 xxxxxxxxxxxGTGA CCAGT ATGC ACTGA AGATGAGGTTTGTGG None.. 1 .. 20 xxxxxxxxxxxxxxx GTGACCAGTATGCACTGAAG ATGAGGTTTGTGGAC

        Hope this helps!


        The way forward always starts with a minimal test.