in reply to More Regular Expressions (text data handling)

Sorry, but you can't consistently parse data that is THIS inconsistent. Blank lines are easy to ignore, but how can you ignore lines with "noise" at the start without a solid way to denote the start of your data?

Try your best to clean up the incoming data. Until then, here are some parsing tricks that might help you keep some of your code maintainable (note I didn't say fast).
my %KEY_PARSER = ( "number" => { START_COMMAND => qr{number:\s*}i, VALUE_MATCH => qr{\d+}, }, "hair color" => { START_COMMAND => qr{hair colou?r:\s*}i, VALUE_MATCH => qr{[\w\s]+}, }, "height" => { START_COMMAND => qr{height:\s*}i, VALUE_MATCH => qr{\d+}, }, "weight" => { START_COMMAND => qr{weight:\s*}i, VALUE_MATCH => qr{\d+}, }, ); foreach my $line (grep {/\w/} <DATA>) { foreach (keys %KEY_PARSER) { while ($line =~ /$KEY_PARSER{$_}{START_COMMAND}/) { $line =~ s/($KEY_PARSER{$_}{START_COMMAND})\s*($KEY_PARSER +{$_}{VALUE_MATCH})//; next unless $2; my ($key,$value) = ($1,$2); chomp ($key,$value); print "Found KEY: $key = $value\n"; } } }
Crap, even when munged by the magical Perl, still smells like crap.

Replies are listed 'Best First'.
Re: Re: More Regular Expressions (text data handling)
by graq (Curate) on Dec 04, 2001 at 20:25 UTC
    As I stated earlier, the noise is before and after and it is possible to identify an index and work from there.
    The number of sets of data is always 70 and the index /^Number:/ is always the third piece of data (after blank lines are removed).
    So you can, somewhat, ignore that for the question. I was including it for completeness.

    <a href="http://www.graq.co.uk">Graq</a>

      As I see it, you require the use of forward lookaheads in a regex:

      Since the line before Number contains the name and the persons details are terminated again by name,
      something that grabs the name and the text between two instances of the name can be got.

      You could then make a hash of names with the value being a hash of details, does that sound good?

      --

      Brother Frankus.

      ¤

        The Number is the unique key for the data.
        Having written this problem down, and examined it as I try to explain it :), I have decided to attempt this approach:
        1. Remove all blank lines.
        2. Find the index and grab 70 lines (-2..68).
        3. Split the data into three sections.
        4. Deal with section overlaps.
        The three sections are:
        1. All lines up to (but excluding) the first line with a colon.
        2. All lines with a colon.
        3. The rest.
        Count 'The rest' and move that many lines from section 2 into (preceding) section 3.

        This should help sort the data.

        <a href="http://www.graq.co.uk">Graq</a>