in reply to complex text file to csv

How are records extracted from files? One record per file, or something else?

Where does the first field ("wing" in the example) come from?

Where does the third field (1094643780 in the example) come from?

What heuristic do you suggest for going from "A wing is a chemical ..." to 'An accelerator is a chemical ..."?


True laziness is hard work

Replies are listed 'Best First'.
Re^2: complex text file to csv
by ikegami (Patriarch) on Jun 05, 2009 at 09:10 UTC

    Where does the first field ("wing" in the example) come from?

    File name?

    Where does the third field (1094643780 in the example) come from?

    Epoch representation of the datetime

      The main point is that good information is more valuable than money in finding a good solution to most problems in a short period of time.

      Sure, we can infer from the context answers that are likely right to most of those questions, but the OP saves a few Q&A iterations and probably coding cycles by providing a good problem specification up front (to say nothing of money).


      True laziness is hard work
Re^2: complex text file to csv
by mpaler (Initiate) on Jun 05, 2009 at 17:39 UTC
    uugh. Sorry for the missing/inconsistent info.

    Let me try again...

    I have a directory 1000 text files with names as follows...

    ... barnyard.chiq beach_bum.chiq belly.chiq belly.chiq.20040413103234 belly.chiq.20040417092739 belly.chiq.20040417093935 # <-- Last created file berry.chiq bert.chiq ...
    If you opened any of files (for example belly above), it would look like this:
    {*JohnDoe* (Saturday, April 10, 2004, 10:41): Belly is a term used to describe a rounded surfboard bottom, when view +ed from the front or rear (not from the side). }
    What I'm looking to do is convert & import the last created file for each record into one big .csv file. Each record needs to map to the csv as follows (still using belly):
    Belly, JohnDoe, 1081593660, "Belly is a term used to describe a rounde +d surfboard bottom, when viewed from the front or rear (not from the +side)."
    Notes: First column in csv is the title of the file minus the extension. Ideally filenames like "beach_bum" can be converted to "Beach Bum." Third column in csv is date converted to unix timestamp. Fourth column is content b/w the : and the last } -- this text block can have any type of character in it, including quotes, commas, etc.

    Again, sorry for the laim first post -- and my $50 contrib to Perl Monks stands...

      Will the content always start on the second line of the file and end at the first EOL-character or or is it really one big stream with possibly multiple EOL-characters inside the content?

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        The latter is true. Here's two more examples with variable EOLs:
        {*JohnDoe* (Thursday, April 01, 2004, 16:04): CATALYST A substance that changes the rate of a chemical reaction without itsel +f undergoing permanent change in composition or becoming a part of th +e molecular structure of the product. A catalyst markedly speeds up t +he cure of a compound when added in minor quantity as compared to the + amounts of primary reactants. See, Methyl_Ethyl_Ketone_Peroxide }
        and
        {*KeithMelville* (Thursday, April 01, 2004, 16:23): CHEATER COAT 1) Resin applied to wood, especially porous ones such as balsa_wood, b +efore laminating as a pre-treatment, to prevent a dry_spot in the lam +ination caused by wood soaking up the resin. 2) Basting of laps or low spots with laminating resin before a hot_coa +t is sometimes called a cheater_coat. See [[basting]]. }