Probably very simple (for those in the know)

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I know that this has to be so simple that I am almost embarrassed to ask...but I have been unable to find the answer to this anywhere (maybe I just don't know what I am looking for).

My company receives orders via e-mail from our customers system. Each message is one order. I need to pull the files apart and drop the data into a database. The part I am having trouble with is the pulling apart and defining fields (I can do the database work, I just can't get the data into nice chunks). These messages have a very standard format (this is just a snippet to show some of the different format issues I have...if you can show me how to deal with these I will we able to mix and match to do the whole message):

feild1 name:
data1
data1
data1
feild2 name:
data2
feild3 name: data3
**a full line of astris**

unimportant text

***field4 name: data4
***field5 name: data5

**a full line of astris**
unimportant text
field6 name: data6

field7 name: data7
field8 name: data8

location 1
field9 name: data9
field10 name: data10

location 2
field11 name: data11
field12 name: data12
etc...

In case it does not display correctly here online, the field names are flush left and the data is aligned some distance out. The field names could be more than one word but they all (the field names) end with a colon Each field is on a separate line. But not every line is a field (there are blank lines and lines of * used to make the message more easily human readable). Not all fields are filled. Some of the field names are duplicated (in the above example field 9 would match field 11 (example state:) and field 10 would match field 12 (example city:). The order is not always the same, and the specific fields change (some messages will have some fields and others will have other fields) The amount of data (number of lines) in field1 is variable (it is a list of addresses which I don't care about anyway.). As you can see, field2's data is on the line below the field name (other than field 1, it is the only one like this.)

I would like to be able to reference the data by it's field name (in the case of the order being changed, I can still refer to the same name. Also, when new fields are added I am set)...

That's it....seems like it should be relatively simple. Anything to the left of a colon is a field name and, on that same line but some distance to the right, is the data. The only exception is field #2 which is on the line below. (the good news here is that the name of this field is always the same. So if I see the word Subject: I can pull data from the line below it).

Questions:
How do i do this?
How can I handle the duplicated field names?

The Perl Nubie

Comment on Probably very simple (for those in the know)

Replies are listed 'Best First'.
Re: Probably very simple (for those in the know) by atcroft (Abbot) on Feb 03, 2002 at 13:39 UTC
I don't know if this will work for what you want (and it may not be the best solution, so I pray other, wiser monks will comment, so we both may learn), but this is how I have handled it in the past (with __DATA__ for testing, fields named as they were so I could verify the results quickly visually). Good luck in finding a solution. #!/usr/bin/perl -w use strict; use warnings; my $multiline_seperator = "\n"; my ($fieldname, $fieldvalue, $line, %pkg); while ($line = <DATA>) { # Lines following empty/space/astrisk-filled lines # assumed to be comments $fieldname = undef if (($line =~ m/^\s+$/) or ($line =~ m/^\+$/)); chomp($line); # Assumes colons do not appear except in lines with field names if ($line =~ m/:/) { ($fieldname, $fieldvalue) = split(/:/, $line, 2); # Remove trailing spaces from field name, # leading spaces from field value $fieldname =~ s/\s+$//g; $fieldvalue =~ s/^\s+//g; } else { $fieldvalue = $line; } # Skip remaining steps if line was a comment next unless (defined($fieldname)); if (exists($pkg{$fieldname})) { $pkg{$fieldname} .= $multiline_seperator . $fieldvalue; } else { $pkg{$fieldname} = $fieldvalue if ((length($fieldvalue)) and (defined($fieldname))); } } # For testing only foreach my $k (sort(keys(%pkg))) { print($k, "\t:\t", $pkg{$k}, "\n"); } __DATA__ a-sendee: data1 data1 data1 b-sender: data2 c-date: data3 ************************************ Copyright blah blah blah d-postage: data4 e-deliverydate: data5 *********************************** unimportant text f-name: data6 g-paycode: data7 g-paycode: data8 location 1 h-state: data9 i-zip: data10 location 2 h-state: data11 i-zip: data12 [download] Update:** I must admit that I read the question and answered before reading carefully all responses, especially the response by jonjacobmoon, which basically spelled out what I coded.	[reply] [d/l]
Re: Probably very simple (for those in the know) by jonjacobmoon (Pilgrim) on Feb 03, 2002 at 12:16 UTC
Could you provide some sample code so we can see what you have tried, and how it did not work to your satisfication That being said, my first thought is that if you read the file into a hash, use the fields as the key, check for the existance of the key before filling the value, then you can easily avoid duplicates. I admit it, I am Paco.	[reply]
Re: Probably very simple (for those in the know) by Anonymous Monk on Feb 03, 2002 at 10:08 UTC
As soon as I hit submit I noticed that the duplicated fields part is probably not clear. The names are what is duplicated (ie: the name for field 9 would match the name of field 11...the data probably will not match). Just wanted to clarify that The Perl Nubie	[reply]