REGEX Frustration

RobertJ has asked for the wisdom of the Perl Monks concerning the following question:

I have a small data base of the form

date TAB description TAB amount RETURN

The description can contain letters, numbers and various characters including . & - # etc

date is of format m/d/y and amount is of the format $dd.dd

I constructed the REGEX to strip out all but the amounts

[\t\/\d\w\ \-\#\*\&]*(\$[\d]*\.[\d]*)\r

and then save in a column

\1\r

to do this. It all works fine leaving only the amounts as a column. The problem occurs if there is a negative amount of the form -$dd.dd. I have been trying to use conditionals and alternations for about two hours with no luck. Would really like some enlightenment.

Thank you

I thought I posted this; however, can't see it. Excuse if I double posted.

For some reason I couldn't post this in Firefox, had to use Safari?

PROBLEM SOLVED

[\t\/\d\w\ \-\#\*\&]*\t(\-?)(\$[\d]*\.[\d]*)\r

\1\2\r
[download]

Still can't figure out why I can't post with Firefox V9

Comment on REGEX Frustration Select or Download Code

Replies are listed 'Best First'.
Re: REGEX Frustration by GrandFather (Saint) on Dec 22, 2011 at 21:14 UTC
What you describe is a CSV (character separated value) file and there are a plethora of modules available to handle that format for you. A good starting point is Text::CSV. Using such a module your problem turns into: `use strict; use warnings; use Text::CSV; my $dataStr = <<DATA; 1/1/1923\tFirst entry\t-23.45 2/2/1924\t"Second entry with a long multi-line description. This is parsed fine using Text::CSV, but I bet your regex chokes."\t23 +.23 DATA my $csv = Text::CSV->new({sep_char => "\t", binary => 1}); open my $dataIn, '<', \$dataStr; while (my $row = $csv->getline($dataIn)) { my ($date, $description, $value) = @$row; print "Value is $value\n"; } close $dataIn;` [download] Prints: `Value is -23.45 Value is 23.23` [download] As an aside the regex magic you want is the ? modifier: `use strict; use warnings; use Text::CSV; my $dataStr = <<DATA; 1/1/1923\tFirst entry\t-23.45 2/2/1924\t"Second entry with a long multi-line description. This is parsed fine using Text::CSV, but I bet your regex chokes."\t23 +.23 DATA open my $dataIn, '<', \$dataStr; while (my $line = <$dataIn>) { next if $line !~ /(-?\d\.\d)/; print "Value is $1\n"; } close $dataIn;` [download] Prints: `Value is -23.45 Value is . Value is .` [download] which you will note still fails in nasty ways because the regex isn't near sufficient to extract the data you want. However, there are a huge number of errors in the regex you show so you absolutely must go read some of the regex documentation: perlretut, perlre and perlreref for a start. True laziness is hard work	[reply] [d/l] [select]
Re: REGEX Frustration by ikegami (Patriarch) on Dec 22, 2011 at 20:49 UTC
Regex can be used for validation or data extraction. You want to perform the latter, so no need to examine the data that closely. You have tab-separated data, so all you need to match your data is: `_____________________ date / ______________ desc / / _______ amount / / / ------ ------ ------ /^[^\t]\t[^\t]\t[^\n]\n\z/` [download] You want to remove the amount, so `s/^[^\t]\t[^\t]\t\K[^\n]//; # 5.10+ s/^([^\t]\t[^\t]\t)[^\n*]/$1/; # Any version` [download]	[reply] [d/l] [select]
Re: REGEX Frustration by roboticus (Chancellor) on Dec 22, 2011 at 21:15 UTC
RobertJ: You may simply be working too hard. If your format is just as described, and you don't have the TAB character anywhere in your data, you could probably just use something like (untested): `while (my $line = <INFILE>) { my ($date, $description, $amount) = split /\t/, $line; if (defined $amount) { print OUTFILE $amount; } else { # Amount not found... } }` [download] If you're really wanting to use a regular expression, though, something like this might be more appropriate: `while (my $line = <INFILE>) { if ($line =~ /(-?\$\d\.\d)/) { print OUTFILE $1, "\n"; } else { # Amount not found... } }` [download] In the regex, '-?' means "Maybe a minus sign". I left out the square brackets because they ~~didn't really add anything~~ just make a mess of things. ...roboticus Update: Corrected square bracket comment. (Kudos to GrandFather for the catch.) When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^2: REGEX Frustration by GrandFather (Saint) on Dec 22, 2011 at 21:24 UTC
Actually leaving out the square brackets is required! Not so much they didn't add anything as, they subtracted a lot! By now maybe the OP has read the regex docs and will realise that the square brackets denote a set of characters which matches just one character in the string being matched. It looks like the OP was thinking [] is equivalent to (). It's not. True laziness is hard work	[reply]
Re: REGEX Frustration by mr.nick (Chaplain) on Dec 22, 2011 at 21:08 UTC
Since it's tab delimited, you could also use split to extract the amount field: `my $amt = ((split(/\t/,$data))[2]);` [download] mr.nick ...	[reply] [d/l]
Re: REGEX Frustration by JavaFan (Canon) on Dec 22, 2011 at 21:24 UTC
`my ($amount) = $line =~ /([^\t]+)$/;` [download]	[reply] [d/l]
Re: REGEX Frustration by TJPride (Pilgrim) on Dec 22, 2011 at 22:48 UTC
This should match ok (assuming the tabs come through in my sample data, anyway). Note that you'll want to change the \n in the output to \r if you want \r. `use strict; use warnings; while (<DATA>) { if (m/\t(-?\$\d+.\d+)$/) { print "$1\n"; } } __DATA__ 12-31-2000 description $44.44 12-31-2000 description -$44.44` [download]	[reply] [d/l]