parsing question

liamlrb has asked for the wisdom of the Perl Monks concerning the following question:

I have several message types that I have to parse and database from chat rooms. Some fields have a specific format, some are free form, and some fields are optional, leading to several message variations and of course there are humans typing them so mistakes are inevitable. I am hoping for some guidance on the best approach to parsing these. I am currently using regex to break these apart but because some fields are optional, it is getting cumbersome to tell which ones were included in the messages. I also want to give feedback when the field is incorrectly formatted and have some flexibility for correcting some common mistakes...

time / date / address / when / free form 1 /free form text 2
time / date / free form  1 / free form 2
time / date / when / free form 1 / what
[download]

I am not sure I can do it effectively with parseRecDescent or any of the other parsers. There seems to be a steep learning curve and of course I need to have it done yesterday... Just looking for some guidance from anyone who has tackled this type of problem... Any help appreciated

Comment on parsing question Download Code

Replies are listed 'Best First'.
Re: parsing question by Your Mother (Archbishop) on Dec 17, 2009 at 03:08 UTC
FWIW, this strikes me as a pretty good match for Parse::RecDescent. I'd suggest ordering top level named grammar parts from ideal-as-expected to worst case. Probably wouldn't take much work to catch 90% of the input (depending on how loose it really is). If you provide sample input and show what you're trying and trying to do, you will likely get a lot of help here with it. Also, don't do anything illegal. Make sure you have the right to parse/save the chat data.	[reply]
Re: parsing question by desemondo (Hermit) on Dec 17, 2009 at 02:15 UTC
I'm no expert and I don't mean to discourage, but I don't think there is a robust way to achieve what you want to do unless your input data is in a key=value format or positional format (ie. bytes 1-8 are header, 9-16 are name, 17-24 are address, etc). Otherwise you're kinda stuck with building code that makes it's best guess what a particular string actually is... and there'll always be things it misses no matter how "smart" the code is... Update: Perhaps rather than trying to find a Perl solution to this problem maybe it would be better to make a change to your chatroom system to implement one of of the above suggestions, or, make all the fields mandatory so that your Perl app will have an expected order of elements to process...	[reply]