Need regex to filter out unwanted rows

ps2931 has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!

I'm trying to filter out rows from a large text files based on the criteria -

The allowed charcters are: aA-zA, underscore(_), colon(:), dot(.), forward slash(/), comma(,), hyphen(-) numbers(0-9) and double quotes (" ").

Any character other than the characters listed above is invalid. I want to print line which failed the criteria. The sample line from test file is something like:

2749 "CQWERC20F+XZIAQAAAQjLiDI9sNc=", "1","ds_uid","CWER1Y1mHZIAQAA8di
+wRHfuwrM=","2012-10-14 18:41:44.429","2012-10-14 18:41:44.572","1975-
+10-10 00:00:00.000","7307 mg rd","","naasik","NK","44026","IN","44063
+59999","","","","DEFAULT","","","AABBCCXX","","Qqwwee<feff>","","qqww
+ee@yahoo.com","0","YOPANEL","","false","en","","","","","","","","","
+","","","","","","","",""
[download]

The above line is invalid since it has '<' symbol. Can anyone help me?

/ps2931

Comment on Need regex to filter out unwanted rows Download Code

Replies are listed 'Best First'.

Re: Need regex to filter out unwanted rows
by Athanasius (Archbishop) on Sep 15, 2014 at 12:39 UTC

Hello ps2931,

In the spirit of TMTOWTDI, here’s a variation on AnomalousMonk’s solution, using tr/// instead of m//:

#! perl
use strict;
use warnings;

while (<DATA>)
{
    chomp;
    my $invalid = tr{- _,:./"0-9a-zA-Z}{}c;
    print "Found $invalid invalid characters in:\n$_\n" if $invalid;
}

__DATA__
This line is OK.

2749 "CQWERC20F+XZIAQAAAQjLiDI9sNc=", "1","ds_uid","CWER1Y1mHZIAQAA8di
+wRHfuwrM=","2012-10-14 18:41:44.429","2012-10-14 18:41:44.572","1975-
+10-10 00:00:00.000","7307 mg rd","","naasik","NK","44026","IN","44063
+59999","","","","DEFAULT","","","AABBCCXX","","Qqwwee<feff>","","qqww
+ee@yahoo.com","0","YOPANEL","","false","en","","","","","","","","","
+","","","","","","","",""

This line is OK too: abc123-_,:./"
[download]

Output:

22:35 >perl 1012_SoPW.pl
Found 6 invalid characters in:
2749 "CQWERC20F+XZIAQAAAQjLiDI9sNc=", "1","ds_uid","CWER1Y1mHZIAQAA8di
+wRHfuwrM=","2012-10-14 18:41:44.429","2012-10-14 18:41:44.572","1975-
+10-10 00:00:00.000","7307 mg rd","","naasik","NK","44026","IN","44063
+59999","","","","DEFAULT","","","AABBCCXX","","Qqwwee<feff>","","qqww
+ee@yahoo.com","0","YOPANEL","","false","en","","","","","","","","","
+","","","","","","","",""

22:35 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Need regex to filter out unwanted rows

by AnomalousMonk (Archbishop) on Sep 15, 2014 at 12:59 UTC

tr/// will be faster than m// for large masses of input, hence IMHO preferable.

[reply]
[d/l]
[select]

Re: Need regex to filter out unwanted rows
by AnomalousMonk (Archbishop) on Sep 15, 2014 at 11:44 UTC

I'm assuming that "aA-zA" from your OP is intended to represent upper/lowercase alpha characters; these are combined with decimal digits in the [[:alnum:]] character class (update: see POSIX Character Classes in perlrecharclass). Try something like (untested):
print $line if $line =~ m{ [^-_,:./,"[:alnum:]] }xms;
(Note that I'm assuming the newline has already been removed from $line. If not, just adding \n to the character class should cover it.)

Update: Upon further examination, your example input seems to include some kind of space character(s). If so, just add \s (any whitespace) or [ \t] (just space and tab) or whatever is appropriate to the char class in my original reply. (Note that the [] square brackets in [ \t] should not be added: I just use these to highlight the presence of a space in that character set.)

[reply]
[d/l]
[select]