Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all!

I'm trying to parse something *like* the following string...

joe(lots of spaces) 0.0000E(one space)000 (spaces) 9.0720E-001 (lots of spaces) d23(lots of spaces) 9.0208E-001(no space)

I would like to capture the following like so: "joe", "0.00E 000" "d23" "9.0720E-001". The reason I say *like* is because I have a script needs to parse strings like the above, but not always the same. These strings are generated, and I have a whole lot of them to parse :( I can't for the life of me figure out a regex to work...The strings have these characteristics:

1. Starts with some string, a name

2. Followed by a whole bunch of space (amount of spacing between each "group of characters" is irregular, but always more than just a single space)

3. Followed by an integer written in scientific form. They will either look like this: #.####E(space)### or this: #.####E-###. The integer is always a decimal integer taken to 4 decimal points ("decimal points"...is this the right term? I've been out of school for too long...)

4. More spacing

5. Followed by another scientific integer written in the same scientific form as described above

6. Followed by a whole bunch of spacing (again irregular, but always more than just a single space)

7. Followed by a text and/or integer "mashup word" (like d9s00 or e893 or 887.9 or irtw, etc) of unknown length

8. Followed again by a whole bunch spaces

9. Followed by another scientific integer

10. End of string. No spaces follow the last "word"

I've been trying all sorts of things all day and the closest thing I've managed was this: @words = split(/\s\s\S/, $some_string). This gets me the first three "words" ("joe", "0.00E 000", and "d23"), but not the last. I'm stumped, any help? Eternally grateful

  • Comment on How to extract these groups of characters?

Replies are listed 'Best First'.
Re: How to extract these groups of characters?
by toolic (Bishop) on Aug 26, 2015 at 01:16 UTC
    split:
    use warnings; use strict; use Data::Dumper; my $str = 'joe 0.0000E 000 9.0720E-001 d23 9.0208E-001'; my @words = split /\s+/, $str; print Dumper(\@words); __END__ $VAR1 = [ 'joe', '0.0000E', '000', '9.0720E-001', 'd23', '9.0208E-001' ];

      I think that to reflect the OP's requirement and keep the scientific notation numbers as single fields the split regex should be /\s\s+/ instead. In all other aspects the approach looks solid.

Re: How to extract these groups of characters?
by Athanasius (Archbishop) on Aug 26, 2015 at 01:20 UTC

    With a regex:

    #! perl use strict; use warnings; use Data::Dump; my $sci = qr{ \d \. \d{4} E [ -] \d{3} }x; while (<DATA>) { my @fields = / ^ (\w+) \s+ ($sci) \s+ ($sci) \s+ (\w+) \s+ ($sci) + $ /x; dd \@fields; } __DATA__ joe 0.0000E 000 9.0720E-001 d23 9.0208E-001 fred 1.2345E-987 2.3456E 456 qrs76 3.4567E 001

    Output:

    11:17 >perl 1357_SoPW.pl ["joe", "0.0000E 000", "9.0720E-001", "d23", "9.0208E-001"] ["fred", "1.2345E-987", "2.3456E 456", "qrs76", "3.4567E 001"] 11:19 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: How to extract these groups of characters?
by kcott (Archbishop) on Aug 26, 2015 at 09:18 UTC

    Your description has a mismatch between the order of data specified and the order captured:

    DATA: "joe(lots of spaces) 0.0000E(one space)000 (spaces) 9.0720E-001 (lots of spaces) d23 ..."

    CAPTURE: "... the first three "words" ("joe", "0.00E 000", and "d23") ..."

    Your description of the spaces separating the data fields is inconsistent. The following assumes "(amount of spacing between each "group of characters" is irregular, but always more than just a single space)" is more accurate than, for instance, the highly vague "a whole bunch spaces".

    My best guess is that the spaces separating the data fields match /\s{2,}/. On this basis, you can simply use split:

    #!/usr/bin/env perl -l use strict; use warnings; while (<DATA>) { chomp; print "Line $.: $_"; print for split /\s{2,}/; } __DATA__ joe 0.00E 000 9.0720E-001 d23 9.0208E-001 joe2 0.00E-000 9.0720E 001 d23 9.0208E 001 joe3 0.00E 000 9.0720E-001 d23 9.0208E 001 joe4 0.00E-000 9.0720E 001 d23 9.0208E-001

    Output:

    Line 1: joe 0.00E 000 9.0720E-001 d23 9.0208E-001 joe 0.00E 000 9.0720E-001 d23 9.0208E-001 Line 2: joe2 0.00E-000 9.0720E 001 d23 9.0208E 001 joe2 0.00E-000 9.0720E 001 d23 9.0208E 001 Line 3: joe3 0.00E 000 9.0720E-001 d23 9.0208E 001 joe3 0.00E 000 9.0720E-001 d23 9.0208E 001 Line 4: joe4 0.00E-000 9.0720E 001 d23 9.0208E-001 joe4 0.00E-000 9.0720E 001 d23 9.0208E-001

    — Ken

Re: How to extract these groups of characters?
by Monk::Thomas (Friar) on Aug 26, 2015 at 09:50 UTC

    I've been trying all sorts of things all day and the closest thing I've managed was this: @words = split(/\s\s\S/, $some_string). This gets me the first three "words" ("joe", "0.00E 000", and "d23")

    I'm wondering about that. If I combine the script provided by toolic and your RegExp, then I get
      'joe   0.0000E 000  9.0720E-001      d23     9.0208E-001'
    ->
      'joe '
      '.0000E 000'
      '.0720E-001    '
      '23   '
      '.0208E-001'
    

    instead of 'joe', '0.00E 000', and 'd23'. (I was trying to understand how your RegExp would result in the output you provided, but it seems it actually doesn't.)

    The RegExp suggested by hippo is correct. The magic conditions in your description were 'scientific numbers are separated by exactly one space' and 'values are separated by at least 2 spaces'. (btw. another way to write the required regexp would be /\s{2,}/ )