Conal has asked for the wisdom of the Perl Monks concerning the following question:

hullo i have some data that looks like this.

http://rafb.net/p/38TDqL58.html

(the data isnt always as clean as this)

i am trying to write a regex to match a line when an exact pattern appears on a line.

to summarise , the patterns i am trying to match is:

(possible white space)EUR/USD(possible white space),(possible white space)X.XXXXX(possible whitespace),(possible whitespace)XX:XX:XX:XX(possible white space/random characters)(end of line)

where X is a number.

I have played around with a few things but just cant work it out.

if ( m{^\s* EUR[/]USD \s*,s\* (\d\.\d{5}) \s*,\s* (\d\d:\d\d:\d\d) \s* + (.*) $ }x ) { // do stuff
doesnt seem to work.

can anyone please offer some help? thanks!

conal.

Replies are listed 'Best First'.
Re: a little REGEX help
by CountZero (Bishop) on Mar 21, 2009 at 23:34 UTC
    Are the line-numbers part of the data? If so, your regex will never match as you anchor your regex to the beginning of the line and don't take the line number into account.

    The following will work with or without line-numbers:

    use strict; use warnings; while (<DATA>) { next unless m/EUR\/USD/; chomp; my ($currency, $amount, $time) = split /\s*,\s*/; $currency =~ s/\d*\s*//g; print "$currency: $amount at $time\n"; } __DATA__ 1 EUR/USD ,1.35590 ,13:09:31 2 EUR/JPY , 129.872 ,13:09:29 3 GBP/JPY , 138.009 ,13:09:32 4 AUD/JPY , 65.939 ,13:09:30 5 EUR/USD ,1.35592 ,13:09:35 6 EUR/JPY , 129.866 ,13:09:35 7 GBP/JPY , 137.999 ,13:09:35 8 AUD/JPY , 65.938 ,13:09:35 9 EUR/USD ,1.35592 ,13:09:35 10 EUR/JPY , 129.866 ,13:09:35 11 GBP/JPY , 137.999 ,13:09:35 12 AUD/JPY , 65.938 ,13:09:35 13 EUR/USD ,1.35592 ,13:09:35 14 EUR/JPY , 129.866 ,13:09:35 15 GBP/JPY , 137.999 ,13:09:35 16 AUD/JPY , 65.938 ,13:09:35 17 EUR/USD ,1.35592 ,13:09:35 18 EUR/JPY , 129.866 ,13:09:35 19 GBP/JPY , 137.999 ,13:09:35
    output:
    EUR/USD: 1.35590 at 13:09:31 EUR/USD: 1.35592 at 13:09:35 EUR/USD: 1.35592 at 13:09:35 EUR/USD: 1.35592 at 13:09:35 EUR/USD: 1.35592 at 13:09:35

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: a little REGEX help
by linuxer (Curate) on Mar 21, 2009 at 23:28 UTC

    If your data sample only reflects "clean" entries, how do "unclean" entries look like?

    I think, for only separating the entries, a split can also do the job...

    #... LINE: while ( my $line = <DATA> ) { chomp $line; # skip all lines which don't contain "EUR/USD" at the beginning next LINE if $line !~ m{^\s*EUR/USD}; # split $line at pattern '\s*,\s*' into at most 3 pieces # and ignore the first piece # for details, see: perldoc -f split my ( $value, $rest ) = ( split( m{\s*,\s*}, $line, 3 )[1,2]; # extract the timestamp from the third piece my $time = ( split m{ }, $rest, 2 )[0]; print "$name : $value : $time\n"; }

    updated PS: your Regex has a typo: \s*,s\*

    Update #2: added code comments

Re: a little REGEX help
by bichonfrise74 (Vicar) on Mar 22, 2009 at 02:41 UTC
    Try this...
    #!/usr/bin/perl use strict; while( <DATA> ) { my ($currency_a, $currency_b, $bogus, $val_a, $time) = $_ =~ /(\w+)\/(\w+)\s?,(\s?|\s+?)(\d+\.?\d+?)\s+?,\s?(.*)/; print "$currency_a, $currency_b, $val_a, $time\n"; } __DATA__ EUR/USD ,1.35590 ,13:09:31 EUR/JPY , 129.872 ,13:09:29 GBP/JPY , 138.009 ,13:09:32 AUD/JPY , 65.939 ,13:09:30 EUR/USD ,1.35592 ,13:09:35 EUR/JPY , 129.866 ,13:09:35 GBP/JPY , 137.999 ,13:09:35
Re: a little REGEX help
by targetsmart (Curate) on Mar 22, 2009 at 10:55 UTC
    .* is greedy, check with .*?
    but don't forget to read the relevant text on 'greedy' in perlretut and in greedy

    Vivek
    -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.
Re: a little REGEX help
by juster (Friar) on Mar 22, 2009 at 21:45 UTC

    If your data is fixed width like your example and purely ASCII text you could also use unpack.

    while (<DATA>) { chomp; my ( $currencies, $amount, $time ) = unpack '@0 A8 @9 A10 @20 A9', + $_; next unless ( $currencies eq 'EUR/USD' ); print <<"END_OUTPUT"; Amount: $amount Time: $time END_OUTPUT } __DATA__ EUR/USD ,1.35590 ,13:09:31 EUR/JPY , 129.872 ,13:09:29 GBP/JPY , 138.009 ,13:09:32 AUD/JPY , 65.939 ,13:09:30 EUR/USD ,1.35592 ,13:09:35 EUR/JPY , 129.866 ,13:09:35 GBP/JPY , 137.999 ,13:09:35 AUD/JPY , 65.938 ,13:09:35 EUR/USD ,1.35592 ,13:09:35

    I read this tip in the perl best practices book.

      Consider using a database for time series data.
Re: a little REGEX help
by Anonymous Monk on Mar 23, 2009 at 16:53 UTC
    I think what you have will work if you remove all the <SPACE> characters within the m() expression. That is, m(^\s*EUR/USD\s+,\s*(\d.\d{5})... rather than m(^\s* EUR...

      Did you notice, that the /x modifier is used? See perlre for details.

      The spaces shouldn't be the problem. The whitespaces in the data string should be handled by the \s*, if they all were written correctly; see: EUR[/]USD \s*,s\*

      As far as I remember I tested the given regex and it worked after fixing that typo...

      update: fixed minor error in presentation