I am writing a script that takes some raw data (text), extracts the set of information and inserts into a database. Wow, big surprise you all say. Yeah, well, I have a feeling that I am not doing this particularly well.

The thing that is making it non-trivial is that the rules for the data are not as trivial as usual. I don't think I will explain the rules, but rather give an example.

# DATA: Start Graq Agnostic Number: 634321 age: 27 hair colour: black height: 73 weight: 123 legs: 2 arms: 2 jameson bells guinness favourite detests likes # DATA: End
(Name is Graq, Number is 634321, age 27, jameson is my fave drink, detest bells etc)

Now this data is also surrounded by further noise, and may contain extra blank lines. But the 'Number' tag is unique, and there is always exactly 70 lines of (non-empty) relavent data, so I can index that and grab the lines I need.

# CODE: Start #!perl -w use strict; my $pasteRaw; while(<STDIN>) { # Remove spaces from before ':' to avoid splitting on 'hair colour' +. substr($_, 0, 1+index($_, ":")) =~ tr/ //d; $pasteRaw .= $_; } my @pasteSplitAll = split( "[^:] |\n", $pasteRaw ); # [1] See below # Remove empty parts of the array. my @pasteNotEmpty = grep { $_ ne "" } @pasteSplitAll; # Defined an index for splicing later. my $index = 0; # Set the index (looking for 'Number') $pasteNotEmpty[$_] =~ /^Number/ and $index = $_ and last for 0..$#pasteNotEmpty; my @pasteUseful = splice( @pasteNotEmpty, $index-2, 70 ); $pasteUseful[$_] =~ s/^\w+:// for 0..$#pasteUseful; # CODE: End
So this gives me an array of 'useful' data. A big thanks to people in CB for some of the individual lines in there, but it is all starting to look a little clumsy (and I am stripping some unwanted values at [1])

So.. to my point. I am looking for some help in handling this data, preferably into a hash, so that I can do stuff with it.

Anything ranging from help on the individual REs to new approaches on tackling the problem as a whole. Am I just kicking a dead horse and might aswell write something a lot less generic?

<a href="http://www.graq.co.uk">Graq</a>


In reply to More Regular Expressions (text data handling) by graq

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.