mpaler has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, I have what seems to me a pretty large request -- so I'm offering a $50 donation to Perl Monks for the solution. It's not alot -- but it's the best I can do right now and I'm super short on time :( I have an old perl wiki (chiq chaq) with 500+ records that I'm trying to import into a database. I'd like to convert the chiq chaq data into a csv file so I can then manipulate, move around, etc... Chiq chaq saves it's data into separate text files in a directory, one for each record. The file names look like this:
wing.chiq wing.chiq.20040417130658 wing.chiq.20040417130727 wing.chiq.20040418184654 woodies.chiq words_with_question_marks.chiq
The contents of record looks like this:
{*JohnDoe* (Wednesday, September 08, 2004, 11:43): A wing is a chemical additive that hastens [[cure]] or chemical reacti +on. For example, cobalt is a wing for [[MEKP]]-catalyzed polyester_r +esin. See, promoted_resin. (_Category_A_) }
I'd like to convert this into a CSV row that looks like this:
wing, JohnDoe, 1094643780, "An accelerator is a chemical additive that + hastens [[cure]] or chemical reaction. For example, cobalt is an ac +celerator for [[MEKP]]-catalyzed polyester_resin. See, promoted_resin +. (_Category_A_)"
I'm leaving in the [word] strings because I can remove those once the text is in csv. I'm also leaving in the "(_Category_A_)" cause it's random... Please note, I forget how to handle large chunks of text that can have commas when converting to csv...do you wrap it in ""? What if there's quotes in the text block? uugh. I'd like the script to read in one file at a time, and save it to a csv file. Any help will be greatly appreciated...

Replies are listed 'Best First'.
Re: complex text file to csv
by CountZero (Bishop) on Jun 05, 2009 at 06:28 UTC
    Do yourself a favour and write the CSV through Text::CSV (it will use Text::CSV_XS for speed if available on your system). It will take care of all quoting issues.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: complex text file to csv
by GrandFather (Saint) on Jun 05, 2009 at 09:03 UTC

    How are records extracted from files? One record per file, or something else?

    Where does the first field ("wing" in the example) come from?

    Where does the third field (1094643780 in the example) come from?

    What heuristic do you suggest for going from "A wing is a chemical ..." to 'An accelerator is a chemical ..."?


    True laziness is hard work

      Where does the first field ("wing" in the example) come from?

      File name?

      Where does the third field (1094643780 in the example) come from?

      Epoch representation of the datetime

        The main point is that good information is more valuable than money in finding a good solution to most problems in a short period of time.

        Sure, we can infer from the context answers that are likely right to most of those questions, but the OP saves a few Q&A iterations and probably coding cycles by providing a good problem specification up front (to say nothing of money).


        True laziness is hard work
      uugh. Sorry for the missing/inconsistent info.

      Let me try again...

      I have a directory 1000 text files with names as follows...

      ... barnyard.chiq beach_bum.chiq belly.chiq belly.chiq.20040413103234 belly.chiq.20040417092739 belly.chiq.20040417093935 # <-- Last created file berry.chiq bert.chiq ...
      If you opened any of files (for example belly above), it would look like this:
      {*JohnDoe* (Saturday, April 10, 2004, 10:41): Belly is a term used to describe a rounded surfboard bottom, when view +ed from the front or rear (not from the side). }
      What I'm looking to do is convert & import the last created file for each record into one big .csv file. Each record needs to map to the csv as follows (still using belly):
      Belly, JohnDoe, 1081593660, "Belly is a term used to describe a rounde +d surfboard bottom, when viewed from the front or rear (not from the +side)."
      Notes: First column in csv is the title of the file minus the extension. Ideally filenames like "beach_bum" can be converted to "Beach Bum." Third column in csv is date converted to unix timestamp. Fourth column is content b/w the : and the last } -- this text block can have any type of character in it, including quotes, commas, etc.

      Again, sorry for the laim first post -- and my $50 contrib to Perl Monks stands...

        Will the content always start on the second line of the file and end at the first EOL-character or or is it really one big stream with possibly multiple EOL-characters inside the content?

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James