ps2931 has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!

I'm trying to filter out rows from a large text files based on the criteria -

The allowed charcters are: aA-zA, underscore(_), colon(:), dot(.), forward slash(/), comma(,), hyphen(-) numbers(0-9) and double quotes (" ").

Any character other than the characters listed above is invalid. I want to print line which failed the criteria. The sample line from test file is something like:

2749 "CQWERC20F+XZIAQAAAQjLiDI9sNc=", "1","ds_uid","CWER1Y1mHZIAQAA8di +wRHfuwrM=","2012-10-14 18:41:44.429","2012-10-14 18:41:44.572","1975- +10-10 00:00:00.000","7307 mg rd","","naasik","NK","44026","IN","44063 +59999","","","","DEFAULT","","","AABBCCXX","","Qqwwee<feff>","","qqww +ee@yahoo.com","0","YOPANEL","","false","en","","","","","","","",""," +","","","","","","","",""

The above line is invalid since it has '<' symbol. Can anyone help me?

/ps2931

Replies are listed 'Best First'.
Re: Need regex to filter out unwanted rows
by Athanasius (Archbishop) on Sep 15, 2014 at 12:39 UTC

    Hello ps2931,

    In the spirit of TMTOWTDI, here’s a variation on AnomalousMonk’s solution, using tr/// instead of m//:

    #! perl use strict; use warnings; while (<DATA>) { chomp; my $invalid = tr{- _,:./"0-9a-zA-Z}{}c; print "Found $invalid invalid characters in:\n$_\n" if $invalid; } __DATA__ This line is OK. 2749 "CQWERC20F+XZIAQAAAQjLiDI9sNc=", "1","ds_uid","CWER1Y1mHZIAQAA8di +wRHfuwrM=","2012-10-14 18:41:44.429","2012-10-14 18:41:44.572","1975- +10-10 00:00:00.000","7307 mg rd","","naasik","NK","44026","IN","44063 +59999","","","","DEFAULT","","","AABBCCXX","","Qqwwee<feff>","","qqww +ee@yahoo.com","0","YOPANEL","","false","en","","","","","","","",""," +","","","","","","","","" This line is OK too: abc123-_,:./"

    Output:

    22:35 >perl 1012_SoPW.pl Found 6 invalid characters in: 2749 "CQWERC20F+XZIAQAAAQjLiDI9sNc=", "1","ds_uid","CWER1Y1mHZIAQAA8di +wRHfuwrM=","2012-10-14 18:41:44.429","2012-10-14 18:41:44.572","1975- +10-10 00:00:00.000","7307 mg rd","","naasik","NK","44026","IN","44063 +59999","","","","DEFAULT","","","AABBCCXX","","Qqwwee<feff>","","qqww +ee@yahoo.com","0","YOPANEL","","false","en","","","","","","","",""," +","","","","","","","","" 22:35 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      tr/// will be faster than  m// for large masses of input, hence IMHO preferable.

Re: Need regex to filter out unwanted rows
by AnomalousMonk (Archbishop) on Sep 15, 2014 at 11:44 UTC

    I'm assuming that "aA-zA" from your OP is intended to represent upper/lowercase alpha characters; these are combined with decimal digits in the [[:alnum:]] character class (update: see POSIX Character Classes in perlrecharclass). Try something like (untested):
        print $line if $line =~ m{ [^-_,:./,"[:alnum:]] }xms;
    (Note that I'm assuming the newline has already been removed from $line. If not, just adding  \n to the character class should cover it.)

    Update: Upon further examination, your example input seems to include some kind of space character(s). If so, just add  \s (any whitespace) or  [ \t] (just space and tab) or whatever is appropriate to the char class in my original reply. (Note that the  [] square brackets in  [ \t] should not be added: I just use these to highlight the presence of a space in that character set.)