How to extract these groups of characters?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all!

I'm trying to parse something *like* the following string...

joe(lots of spaces) 0.0000E(one space)000 (spaces) 9.0720E-001 (lots of spaces) d23(lots of spaces) 9.0208E-001(no space)

I would like to capture the following like so: "joe", "0.00E 000" "d23" "9.0720E-001". The reason I say *like* is because I have a script needs to parse strings like the above, but not always the same. These strings are generated, and I have a whole lot of them to parse :( I can't for the life of me figure out a regex to work...The strings have these characteristics:

1. Starts with some string, a name

2. Followed by a whole bunch of space (amount of spacing between each "group of characters" is irregular, but always more than just a single space)

3. Followed by an integer written in scientific form. They will either look like this: #.####E(space)### or this: #.####E-###. The integer is always a decimal integer taken to 4 decimal points ("decimal points"...is this the right term? I've been out of school for too long...)

4. More spacing

5. Followed by another scientific integer written in the same scientific form as described above

6. Followed by a whole bunch of spacing (again irregular, but always more than just a single space)

7. Followed by a text and/or integer "mashup word" (like d9s00 or e893 or 887.9 or irtw, etc) of unknown length

8. Followed again by a whole bunch spaces

9. Followed by another scientific integer

10. End of string. No spaces follow the last "word"

I've been trying all sorts of things all day and the closest thing I've managed was this: @words = split(/\s\s\S/, $some_string). This gets me the first three "words" ("joe", "0.00E 000", and "d23"), but not the last. I'm stumped, any help? Eternally grateful

Comment on How to extract these groups of characters?

Replies are listed 'Best First'.

Re: How to extract these groups of characters?
by toolic (Bishop) on Aug 26, 2015 at 01:16 UTC

use warnings;
use strict;
use Data::Dumper;

my $str = 'joe   0.0000E 000  9.0720E-001      d23     9.0208E-001';
my @words = split /\s+/, $str;
print Dumper(\@words);

__END__

$VAR1 = [
          'joe',
          '0.0000E',
          '000',
          '9.0720E-001',
          'd23',
          '9.0208E-001'
        ];
[download]

Re^2: How to extract these groups of characters?

by hippo (Archbishop) on Aug 26, 2015 at 08:23 UTC

I think that to reflect the OP's requirement and keep the scientific notation numbers as single fields the split regex should be /\s\s+/ instead. In all other aspects the approach looks solid.

Re: How to extract these groups of characters?
by Athanasius (Archbishop) on Aug 26, 2015 at 01:20 UTC

With a regex:

#! perl
use strict;
use warnings;
use Data::Dump;

my $sci = qr{ \d \. \d{4} E [ -] \d{3} }x;

while (<DATA>)
{
    my  @fields = / ^ (\w+) \s+ ($sci) \s+ ($sci) \s+ (\w+) \s+ ($sci)
+ $ /x;
    dd \@fields;
}

__DATA__
joe     0.0000E 000       9.0720E-001   d23        9.0208E-001
fred  1.2345E-987     2.3456E 456       qrs76   3.4567E 001
[download]

Output:

11:17 >perl 1357_SoPW.pl
["joe", "0.0000E 000", "9.0720E-001", "d23", "9.0208E-001"]
["fred", "1.2345E-987", "2.3456E 456", "qrs76", "3.4567E 001"]

11:19 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: How to extract these groups of characters?
by kcott (Archbishop) on Aug 26, 2015 at 09:18 UTC

Your description has a mismatch between the order of data specified and the order captured:

DATA: "joe(lots of spaces) 0.0000E(one space)000 (spaces) 9.0720E-001 (lots of spaces) d23 ..."

CAPTURE: "... the first three "words" ("joe", "0.00E 000", and "d23") ..."

Your description of the spaces separating the data fields is inconsistent. The following assumes "(amount of spacing between each "group of characters" is irregular, but always more than just a single space)" is more accurate than, for instance, the highly vague "a whole bunch spaces".

My best guess is that the spaces separating the data fields match /\s{2,}/. On this basis, you can simply use split:

#!/usr/bin/env perl -l

use strict;
use warnings;

while (<DATA>) {
    chomp;
    print "Line $.: $_";
    print for split /\s{2,}/;
}

__DATA__
joe     0.00E 000   9.0720E-001     d23     9.0208E-001
joe2    0.00E-000   9.0720E 001     d23     9.0208E 001
joe3    0.00E 000   9.0720E-001     d23     9.0208E 001
joe4    0.00E-000   9.0720E 001     d23     9.0208E-001
[download]

Output:

Line 1: joe     0.00E 000   9.0720E-001     d23     9.0208E-001
joe
0.00E 000
9.0720E-001
d23
9.0208E-001
Line 2: joe2    0.00E-000   9.0720E 001     d23     9.0208E 001
joe2
0.00E-000
9.0720E 001
d23
9.0208E 001
Line 3: joe3    0.00E 000   9.0720E-001     d23     9.0208E 001
joe3
0.00E 000
9.0720E-001
d23
9.0208E 001
Line 4: joe4    0.00E-000   9.0720E 001     d23     9.0208E-001
joe4
0.00E-000
9.0720E 001
d23
9.0208E-001
[download]

— Ken

[reply]
[d/l]
[select]

Re: How to extract these groups of characters?
by Monk::Thomas (Friar) on Aug 26, 2015 at 09:50 UTC

I've been trying all sorts of things all day and the closest thing I've managed was this: @words = split(/\s\s\S/, $some_string). This gets me the first three "words" ("joe", "0.00E 000", and "d23")

  'joe   0.0000E 000  9.0720E-001      d23     9.0208E-001'
->
  'joe '
  '.0000E 000'
  '.0720E-001    '
  '23   '
  '.0208E-001'

instead of 'joe', '0.00E 000', and 'd23'. (I was trying to understand how your RegExp would result in the output you provided, but it seems it actually doesn't.)

The RegExp suggested by hippo is correct. The magic conditions in your description were 'scientific numbers are separated by exactly one space' and 'values are separated by at least 2 spaces'. (btw. another way to write the required regexp would be /\s{2,}/ )

Back to Seekers of Perl Wisdom