comment on

A very typical task and one for which Perl is particularly well suited. Congratulations on choosing the right tool for the job.

We do such tasks on a regular basis (extracting statistical and claims data out of text files and putting them in a CSV-format prior to including the data into a database).

So from dire experience I can tell you that there are numerous ways that such a seemingly simple task can turn around and bite you.

The way to solve it is by first analysing the text file and answering the question "How are the various records and fields separated from each other?" Are they delimited, i.e. is there a special (unique?) character that separates one field or record from another, such as a tab-character, a new-line, a space, ... or are the fields put into columns of fixed lengths or are the fields surrounded by some tags (XML-style), ... . Is this format applied consistently through-out the whole file or is the file littered with non-data lines and items, such as headers, page numbers, ...

Many times there is a combination of various methods, such as records separated by new-line and the fields are of a fixed length.

Based upon the above analysis you then devise a parsing strategy, e.g. read the file line by line if the records are "new-line" separated and then split the records into its fields based upon the unique separating character (think of using the split function!). If the separating character is not unique (but can also be found in the text of the field itself (commas and spaces!), are they then somehow escaped or quoted? If the fields are fixed-length you will want to look into unpack to get the data.

Finally, once you have the fields extracted you have to output them in the CSV-file and for that no better solution exist than to use existing CPAN-modules such as Text::CSV::Simple or Text::CSV_XS.

Something you might als want to look into is DBD::AnyData which gives a standard database-like interface to various types of datafiles.

CountZero

"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

In reply to Re: convert txt file to csv file by CountZero
in thread convert txt file to csv file by aztec

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.