in reply to Grab everything not matched by regexp
Mm, this feels like an issue not appropriate for a quick fix. You may wake up at any time to decide you need to extend it in an unexpected direction. If your data were better behaved, I'd look for the elegant solution. But here I'd incline to more of a heavy-duty, ugly strategy. I'm advocating a procedural, rather than a functional or logical approach. Given your sample code, you may, I dare say, prefer procedural.
This reminds me of a task I worked several years ago, parsing Shakespeare's works from a badly proofed text. I needed to throw out misspellings, accept the Bard's eccentric spellings, discard stage directions, and otherwise deal with unpredictable data. There are, if I recall, a couple tens of thousands of lines and I had to see results from one attempt before modifying my code for the next; a quick eyeball of the data failed to reveal some of the most annoying exceptions. I believe I made the task more difficult by tying myself early on to a functional approach.
Let's say you concentrate first on breaking each raw record into fields. These fields may or may not correspond to actual desired content; rather, you determine how to split up the raw record based on what you already know (without too much regexing). For instance, you might define a set of delimiters and split on them, bearing in mind that you probably want to append the delimiter to the field, in case it wasn't a throwaway after all.
I'd look for a rough cut here and be willing to entertain overlapping fields. Keep in mind that you can have the same data in more than one field: "cats", "paw", and "cats paw". Some mix of small tokens and medium chunks will probably work best.
Store all the fields associated with each record in some sort of structure, not necessarily a full-blown object. An array might be enough, or a hash.
In the second pass, you examine rough-cut fields in detail and decide whether to throw them away or munge them further. You may decide what to do with one field by testing another. Don't delete anything until you've created the sanitized record full of clean fields in a second structure. Then toss the entire rough record.
You might like to make the intermediate, rough-cut structure something like an array of hashes. The array index just tells the (possibly unimportant) sequence in which each chunk was pulled out; for each chunk, there are key/value pairs telling the chunk data itself, the split or other technique used to pull it, and whether the chunk has been incorporated into the final, clean record. Your final rule will sweep up any unused chunks and concatenate them into a 'notes' field in the clean record.
This will seem rather a muddle when the code is reviewed but it's forgiving. You can add rules to the first pass until you're sure, by dumping the rough records, that you've chopped the raw records into small enough chunks.
Then again, on the second pass, you can add rules one at a time until you've squeezed out everything you can identify. This is forgiving, because fooling with one rule will not break another or damage the rough record.
Your code will come out looking like an accumulation of unplanned kludges, which is exactly what it will be. It's a one-off fix. Don't build a superandroid to herd a few cats.
Many Monks will post much more elegant solutions but if your data is as jumbled as I fear, these may prove fragile and difficult to amend. On the other hand, the approach I offer will be difficult to maintain and reuse.
I do strongly suggest that you spend goodly time browsing CPAN for tools you like for this job. But there are over 4000 modules that "parse" something, somehow; don't get lost in the woods.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Grab everything not matched by regexp
by Kerplunk (Acolyte) on Feb 24, 2010 at 14:06 UTC | |
|
Re^2: Grab everything not matched by regexp
by molecules (Monk) on Feb 23, 2010 at 19:52 UTC |