in reply to Extracting data from a flat file and inserting it into a databas
The term "flat file" refers to a plain text file that represents a simple two-dimensional table: each line of the file is one row of the table, and contains one or more fields (columns of the table); if there are two or more fields per line (row), these are separated by some sort of clear and unambiguous character or string (tab, comma, colon or whatever), provided that this cannot occur as part of a field value. This property makes it "machine readable".
A CSV file is a particular type of flat file, where the fields on a line are separated by commas, but a field may contain a comma character as part of its content if the field is surrounded by quotation marks. (And that means that if the field content includes quotation characters, these must be escaped, usually by prefixing each one with a backslash.) The use of quotes and escapes assures that the data are still machine readable.
What you have shown in your post is not machine readable -- it's a format intended for humans, who have the ability to "see" and understand structural boundaries, by "simply" understanding the meanings of labels, values and patterns of whitespace.
If you do not have access to some other form of this data that would be easier for a machine to read, you're going to have a tough job handling the data that you've shown. But it is doable.
If the spacing of columns is consistent, you can categorize the lines based on their type of content, as flagged by some distinctive word token common to all lines of a given type. For example, the particular lines that match the following regular expressions each contain a specific set of fields, and have a specific pattern of spacing betwen the fields (and we can only hope that the spacing is consistent across the 7000 records):
You'll probably end up using "substr()" or "unpack()", with different numeric parameters for each type of line, in order to pull out the field names and values for the line -- e.g (guessing at the column widths):/^Request Number:/ /^Status:/ /^Affected User:/ ... /\d+\s+minutes\s+\d+\s+seconds/ # (update: had forgotten 3rd "\s+")
When you get to the "fields" that involve multi-line content, you'll need to be more clever, possibly keeping a "state" variable to guide the parsing: as you see the start of a (potentially) multi-line field, set $state to the name of that field, and append subsequent lines to that field value until you see evidence of a new field starting.my %record; if ( /^Request Number:/ ) { my @flds = (); push @flds, substring( $_, 0, 40 ); push @flds, substring( $_, 40, 26 ); push @flds, substring( $_, 66 ); # rest of line for my $fldname ('Request Number', 'Parent', 'Priority') { (my $val = shift @flds) =~ s/^$fldname\s*:\s*//; $val =~ s/\s*$//; # update: remove leading/trailing whi +tespace... $record{$fldname} = $val; # which means this could be "" -- not + a problem } } # and likewise for each other type of line
Good luck -- especially with getting a more readable version of the data, so you won't have to do all this hard labor.
Update: Since you mentioned you are new at this, I should add one more point about the strategy in the code snippet above. The idea there is to fill a hash array (called "%record"), with a set of field values. Once you get to a point in your file where a new record begins, you need to save the current contents of %record somewhere -- i.e. your Access or other database -- then proceed with parsing the next record (reusing the same space/elements in the hash).
One thing that could simplify this for you is if you can determine a specific string that always occurs at record boundaries and nowhere else. Set "$/" (INPUT_RECORD_SEPARATOR) equal to this string, and you can structure your code as follows:
$/ = "end-of-record-string"; # whatever that may be while (<>) { # this now reads one whole record into $_ my @lines = split( /\n/ ); my %record while ( @lines ) { $_ = shift @lines; # proceed with parsing record fields as above... if ( /^Request Number:/ ) { ... } elsif ... } store_to_database( \%record ); }
|
|---|