Regular Expression Problem

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expression Problem by Limbic~Region (Chancellor) on Dec 14, 2008 at 00:07 UTC
Anonymous Monk, Parsing CSV (where C stands for character) using split or regexes can be fragile. Perhaps you don't have to worry about embedded delimiters or multi-line records, but it doesn't take much more effort to future proof your code by handling them now. (untested) `#!/usr/bin/perl use strict; use warnings; use Text::CSV; my $file = $ARGV[0] or die "Usage: $0 <input_file>"; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my $csv = Text::CSV->new({sep_char => "\t", binary => 1}); $csv->column_names($csv->getline($fh)); while (my $hr = $csv->getline_hr($fh)) { print "Amino acid is: $hr->{SEQ} and Prediction is: $hr->{PRD}\n"; }` [download] Cheers - L~R	[reply] [d/l]
Re: Regular Expression Problem by gone2015 (Deacon) on Dec 14, 2008 at 00:19 UTC
So, `$csv_line =~ /(\d)\t(\d)\t(\d.d{3})\t(\d.d{3})\(\d.d{3})/` [download] looks plausible. Except, '`\d`' matches a digit. It won't match an alphabetic character. '`\w`' will match alphanumeric (and `'_'`), or you can be more precise with `[A-Z]` type character classes. You don't anchor the start of the regex to the start of the string, so this will happily match to something deep inside the input string, if it doesn't match at the start. (If you do decide to anchor the regex, look out for the space or whatever it is before the first character.) You're matching the rest of the line quite carefully and precisely -- if the numbers are not of the exact form given, you will miss lines. You might want to think about some diagnostic messages for lines that are not matched, so you don't miss stuff without knowing about it. I couldn't see what the '`+`' in `/SEQ+/` was for. What it will do is match '`SEQ`' or '`SEQQ`' or '`SEQQQ`' etc. Again this is not anchored.	[reply] [d/l] [select]
Re: Regular Expression Problem by johngg (Canon) on Dec 14, 2008 at 00:22 UTC
Your first problem is that you are looking for two single letters but using `\d` in your pattern each time, which means a digit; use a character class of `[A-Z]` (or `[A-Za-z]` if you expect lower case as well). Secondly, you are using parentheses to capture your three floating point numbers but you dont seem to be using the captures afterwards. Thirdly, a dot is a regular expression metacharacter matching any character (with caveats, see perlretut and perlre) so you need to escape it to match a literal dot. Fourth, using `\t` is fine if you are absolutely certain that you will only ever have a single tab as a separator; `\s+` for one or more white spaces is more robust. Fifth, `/SEQ+/` means an 'S', an 'E' then one or more 'Q's and if the 'SEQ' should be at the beginning of the line the pattern should be anchored with a caret; so `/^SEQ.+/` might be better. You are likely to have far more data lines than header lines so it wil save cpu ticks to put that condition first. `... if( $csv_line =~ /([A-Z])\s+([A-Z])(?:\s+\d\.\d{3}){3}/ ) { ... } elsif( $csv_line =~ /^SEQ.+/ ) { ... } else { # What do you do with a line that matches neither pattern? } ...` [download] I hope these points are useful. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Regular Expression Problem by ikegami (Patriarch) on Dec 14, 2008 at 04:41 UTC
`/^SEQ.+/` is the same as as `/^SEQ./`, and it's nearly equivalent to `/^SEQ/`. The last option is surely the best of the three in this case. — Except in the unlikely case that any of `@-`/`@+`/`$&`/`$'` are used.	[reply] [d/l] [select]
Re^2: Regular Expression Problem by Anonymous Monk on Dec 14, 2008 at 00:46 UTC
THANK YOU ALL VERY MUCH FOR YOUR COMMENTS! In my ignorance I assumed the '+' could function as a wildcard (ie, read SEQ and everything following it!). Much obliged, InfoSeeker	[reply]
Re: Regular Expression Problem by graff (Chancellor) on Dec 14, 2008 at 04:20 UTC
For tab-delimited table data (which I would call "TSV"), it is most often the case that you don't really need to worry about the separator character (tab) being embedded as data within one of the fields (requiring that the field be "quoted" in some way to protect the field-data-internal tabs being misconstrued as separator characters). Your data seems to fall easily into the common case. And in that case, I prefer using split: `my @fields = split /\t/, $tsv_line; if ( $fields[0] eq 'SEQ' ) { print "Header of Predictions\n"; } elsif ( join("", @fields[0,1]) =~ /^\w\w$/) { push @ProteinSeq, $fields[0]; push @Prediction, $fields[1]; printf "Amino acid is: %s and Prediction is: %s\n", @fields[0,1]; }` [download]	[reply] [d/l]
Re: Regular Expression Problem by Lawliet (Curate) on Dec 14, 2008 at 00:20 UTC
First of all, your first regex, `/SEQ+/`, matches: SEQ, SEQQ, SEQQQ, SEQQQQ etc. Is that what you want? In your second regex, `\d` matches a digit which are characters 0-9. I think you want `\w` in order to match the word characters. `if ($csv_line =~ /SEQ/) { # If the line contains 'SEQ' anywhere in it +this will be true print "header\n"; } elsif ($csv_line =~ /^(\w)\t(\w)\t/) { # This will match a word char +acter at the beginning of a line followed by a tab, another word char +acter, and another tab print "Amino Acid:$1\nPrediction:$2\n"; }` [download] I'm so adjective, I verb nouns! chomp; # nom nom nom	[reply] [d/l] [select]
Re: Regular Expression Problem by CountZero (Bishop) on Dec 14, 2008 at 16:42 UTC
Only in the most simple and most regular cases, should one try to deal with CSV type files using a simple regex or `split`. Contrary to what it looks like at first sight, this file-format can be quite difficult to handle because of embedded field-separators, escaped field- or record separators, quoted strings, ... Therefore your first reaction should always be to use a well tried and tested module such as Text::CSV to split your data records into fields. Not only will it save you from making obvious and not so obvious errors, once you are used to its interface, it will make for fast and easy programming. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re: Regular Expression Problem by Anonymous Monk on Dec 15, 2008 at 06:08 UTC
Your Regex should look more like: elsif ($csv_line =~ /(\w)\t(\w)\t(\d.d{3})\t(\d.d{3})\(\d.d{3})/) The \d is for digits, not letters.	[reply]
Re: Regular Expression Problem by Anonymous Monk on Dec 15, 2008 at 04:26 UTC
hi friend, try out this one `open(FILE,"perlmonks.txt") or die $!; while(<FILE>){ if(/^\s(\w)\s+(\w)/){ print "$1 ::: $2\n"; } }` [download]	[reply] [d/l]