scribefaith has asked for the wisdom of the Perl Monks concerning the following question:

I have a flat file exported from a database in what was supposed to be .CSV but I am not sure if it really is. I have posted an example of a help desk ticket this file contains. There are about 7000 tickets in one file in the same format as a long list (flat file).
I want a Perl script or something (I am just learning PERL, knowing that it will be my number one administration tool someday when I have learned it) that will look through the file and do this; Take the field names listed below:

Request Number
Affected User
Reported By
Assignee
Request Area
Open Date
Description
History (this one needs to be some kind of format that allows the capture of all the data under it, like memo does in access.

And either make a database or insert it into an Access Table in an Access Database, and also insert the data following each field, as the value of the field, and make it as one record by Request Number.

I imagine there are many ways to do this, but I cannot do it with things like importing into Access, it just does not have the ability to delimit it based on the : character which I already tried.

From What I hear about Perl is that it can do this for me.

Can someone show me the script that I can use or send me to the right rescource that matches this type of example the closest. I am being diligent to try and learn myself but I have been reading books for the past two days and I am not sure yet where to start.
Thanks for your help.
Here it the format of the file. If you clip and paste it in notepad or some other text editor it might be more exact.
As Of: 01 +/22/2004 Requests Request Number: 262944 Parent : Prio +rity:4-Med Status: Closed Active :No Affected User: Smith, Jane Impa +ct :None Reported By: Doe, John Template: Urge +ncy : Assignee: Doe, Jane SLA Violation:0 Seve +rity: Group Asgd. To: IT-Group Service :Empty + Change: Request Area: Acc-Grp Root Cause: Affected Asset: ChargeBack: Open Date: 12/31/2002 Close Date:1/03/2003 Resolve Date: Time Spent:0 days 0 hours 1 + minutes Last Modified: 3/27/2003 End User ID: Phone Number: 3363 Alt Phone : + Fax Number: Timezone : + Summary: EMAIL I need authorization to XXX. That's a letter o +not... Description: EMAIL I need authorization to XXX. That's a letter o +not a number zero. + + (email) Sorry, I need this in TST and/or DEV + History Analyst Date Time Spent Type ---------------------------------------------------------------------- +---------- Doe, John 12/31/2002 0 minutes 47 seconds Initial Desc:create a new request/change order + Doe, John 12/31/2002 0 minutes 5 seconds Log Comment Desc:(email) OK + Doe, John 12/31/2002 0 minutes 7 seconds Close Reques +t/Change Order + Desc:(email) Jane, the change should take effect the + next time you log on. + + + You may close this ticket. + Doe, John 1/02/2003 0 minutes 4 seconds Log Comment Desc:(email) It is not working + Doe, John 1/02/2003 0 minutes 0 seconds Reopen Desc:Status changed from 'Closed' to 'Open'

Replies are listed 'Best First'.
Re: Extracting data from a flat file and inserting it into a databas
by graff (Chancellor) on Jan 24, 2004 at 18:56 UTC
    If that sample of data really represents the contents of the file you need to use as input, then I'd have to say:
    • it is definitely not a CSV file
    • it is not anything like a "flat" file
    • it is not going to be easy to split this file into discrete records, or to split each record into discrete columns of data.

    The term "flat file" refers to a plain text file that represents a simple two-dimensional table: each line of the file is one row of the table, and contains one or more fields (columns of the table); if there are two or more fields per line (row), these are separated by some sort of clear and unambiguous character or string (tab, comma, colon or whatever), provided that this cannot occur as part of a field value. This property makes it "machine readable".

    A CSV file is a particular type of flat file, where the fields on a line are separated by commas, but a field may contain a comma character as part of its content if the field is surrounded by quotation marks. (And that means that if the field content includes quotation characters, these must be escaped, usually by prefixing each one with a backslash.) The use of quotes and escapes assures that the data are still machine readable.

    What you have shown in your post is not machine readable -- it's a format intended for humans, who have the ability to "see" and understand structural boundaries, by "simply" understanding the meanings of labels, values and patterns of whitespace.

    If you do not have access to some other form of this data that would be easier for a machine to read, you're going to have a tough job handling the data that you've shown. But it is doable.

    If the spacing of columns is consistent, you can categorize the lines based on their type of content, as flagged by some distinctive word token common to all lines of a given type. For example, the particular lines that match the following regular expressions each contain a specific set of fields, and have a specific pattern of spacing betwen the fields (and we can only hope that the spacing is consistent across the 7000 records):

    /^Request Number:/ /^Status:/ /^Affected User:/ ... /\d+\s+minutes\s+\d+\s+seconds/ # (update: had forgotten 3rd "\s+")
    You'll probably end up using "substr()" or "unpack()", with different numeric parameters for each type of line, in order to pull out the field names and values for the line -- e.g (guessing at the column widths):
    my %record; if ( /^Request Number:/ ) { my @flds = (); push @flds, substring( $_, 0, 40 ); push @flds, substring( $_, 40, 26 ); push @flds, substring( $_, 66 ); # rest of line for my $fldname ('Request Number', 'Parent', 'Priority') { (my $val = shift @flds) =~ s/^$fldname\s*:\s*//; $val =~ s/\s*$//; # update: remove leading/trailing whi +tespace... $record{$fldname} = $val; # which means this could be "" -- not + a problem } } # and likewise for each other type of line
    When you get to the "fields" that involve multi-line content, you'll need to be more clever, possibly keeping a "state" variable to guide the parsing: as you see the start of a (potentially) multi-line field, set $state to the name of that field, and append subsequent lines to that field value until you see evidence of a new field starting.

    Good luck -- especially with getting a more readable version of the data, so you won't have to do all this hard labor.

    Update: Since you mentioned you are new at this, I should add one more point about the strategy in the code snippet above. The idea there is to fill a hash array (called "%record"), with a set of field values. Once you get to a point in your file where a new record begins, you need to save the current contents of %record somewhere -- i.e. your Access or other database -- then proceed with parsing the next record (reusing the same space/elements in the hash).

    One thing that could simplify this for you is if you can determine a specific string that always occurs at record boundaries and nowhere else. Set "$/" (INPUT_RECORD_SEPARATOR) equal to this string, and you can structure your code as follows:

    $/ = "end-of-record-string"; # whatever that may be while (<>) { # this now reads one whole record into $_ my @lines = split( /\n/ ); my %record while ( @lines ) { $_ = shift @lines; # proceed with parsing record fields as above... if ( /^Request Number:/ ) { ... } elsif ... } store_to_database( \%record ); }
Re: Extracting data from a flat file and inserting it into a databas
by CountZero (Bishop) on Jan 24, 2004 at 22:28 UTC
    It is possible to do what you want with Perl, but it will not be easy (believe me I had to do it myself and it took me a lot of time and sweat and cursing and ...), so perhaps one should consider a different approach.

    I read "I have a flat file exported from a database" and I wonder whether you can directly access this database.

    As you know Perl has the wonderful DBI /DBD family of modules which allow you to work with most types of databases (including MS Access).

    So if you need to transfer data from one database to another, DBI/DBD is the obvious way to go: write some SQL to select the data you need, execute this SQL, save the results in an array or hash-variable, then insert the data in the other database.

    If you cannot speak to the original database, perhaps you can change the export format. A real "CSV" format seems not possible (due to the repeating fields at the end), but an 'XML' format is ideal. XML is very well suited for data-interchange.

    One last comment: Access is perhaps not the best choice when you have a lot of data to transfer. After a few thousand records the performance really goes down (if not worse and it crashes on you). Before that happens you should investigate into using a real database server (MySQL, Postgres, to name but a few).

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Extracting data from a flat file and inserting it into a databas
by lilphil (Scribe) on Jan 24, 2004 at 18:59 UTC
    Here is a start for you:
    #!/usr/bin/perl -w use strict; my $file = 'scribefaith'; open(FILE,"<$file") or die "Cannot open $file"; while (<FILE>) { print $1 if (m/^Request Number:\s+([0-9]+)/); # insert various regular expressions here to match tags } close(FILE);
    this extracts the Request number. I get the feeling the file should have more tabs than spaces, so I would suggest m/^Affected User:\s+(.*)\t+/ as the next regex, I'm sure more talented monks will come up with a better set of regex however... I'll leave you to work out the rest :)

    I know there's a cpan module to do CSV files, but this isnt a CSV (or at least not in my world...) perhaps it could be adapted to deal with this file format.

    Update: fix typo
Re: Extracting data from a flat file and inserting it into a databas
by BrowserUk (Patriarch) on Jan 25, 2004 at 03:48 UTC

    Perl can definitely do what you want. As could most other languages, though it will be easier in Perl than many. Even so, the complexity, or rather, the non-uniformity of the record structure is such that it is definitely non-trivial.

    If your not in a hurry for your results, then it would be an ideal project to get you up to speed on perl, you will learn a great deal getting this right.

    If however, you are in a hurry, you should seriously consider hiring the services of a programmer experienced in Perl to do this for you. You can then learn from seeing what he produces, and you will be better set for tackling the next task yourself.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Timing (and a little luck) are everything!