comment on

Hello, new perl user here with a real-life problem I'm trying to solve.

I've been through the first Llama book by O'Reily so I am familiar with perl in that regard, however applying the concepts to real-life problems is where I get stuck. I never know where to start, so I try and then my code gets messy, complicated, and klugey. I think I need more experience and then it will come more naturally and logical. So that's where I'm at.

For the record, I'm not looking to have the work done for me or to be spoon-fed. Working the problem is the only way I'll become more proficient with perl but at the same time I legitimately need some mentorship. Thank you in advance.

For reasons beyond my control, I'm using Perl v5.8.8 on RHEL 5.9

In a nutshell, here's what I trying to do using perl. I have multiple folders that contain thousands of small text files (all have unique file names) and I need to validate that there is data in certain fields. It is not important what the data is, just that there is data in the fields. The text files vary in length, 20 to sometimes 100+ lines long. The lines that contain the fields I'm interested in all start with the same keyword and is unique to those lines. If there is data in the fields, I'm happy. However, if the field contains no data or contains a dash, then I need to know. The fields in the lines are deliminated by a forward slash. I cannot post the files due to being proprietary and I will be in violation of my non-disclosure agreement but if needed I can post a rough example of how the text files are generally formatted and containing bogus data.

My first solution was not with perl but on the command line using bash & awk and looked like this: cat *.TXT | awk '/uniqueKeywordAtBeginningOfLine/' | cut -f 6,7,8 -d '/' That worked, but as the number of files in the directories grew larger, in the thousands, I piped the above to more but paging one screen at a time and looking at the data is incredibly time consuming and just plain inefficient. I'm certain perl can do the job but I'm having trouble getting started. What I'm trying to avoid is a unwieldy kluge.

My first thought was to incorporate the awk into a perl script that would analyze the files but after some Google searching, it seems that using perl exclusively is the better option.

My trouble now is I'm not sure where to start. I'm thinking that it seems better to slurp each file into an index of an array and then run a foreach loop to iterate over the files, an elseif to check the fields of the lines I'm interested in, and if all are good then skip to the next index in the array (the next file). But if the fields are empty or contain a dash then report the filename for further manual examination.

If someone can give me their opinion on how I should start this out, I'd be much appreciative. Thanks

UPDATED to give example of the data files that are to be processed.

They are similar to below, with most having more than one INTERESTING line to check, but sometimes only having one.

Fields 6,7,8 of the INTERESTING line(s) are the fields I need to ensure are not blank or do not just contain a dash. The below example would be considered 'good' and not need further examination.

ZZZZZ ZZZZZZ 1111111-BBBB--CCCCC.
DDD EEEEE
F 222222G HHH 33 III
JJ JJJJ LLLLL MMMMMMMM NNNNN//OO/PPPPP//
QQ RRRRRR/SSSSSSSS
TT
U U U U U U//VVV WW XXX/XXX/XXX/XXX
YYY
ZZZZ/AA-44/55555556/BBB//
CCCCCCC/D6//
EEE/7777/8888888/999999G/H000I/-/-/-//
JJJJJJ/11//
INTERESTING/22/M/NNN3333333333P/-/444.5Q/6.77RR/8.99RR//
EEE/7777/8888888F/999999G/H000/-/-/-//
JJJJJJ/11//
INTERESTING/33/T/UUU4444444444V/-/555.6R/7.88TT/9.11UU//
LLLLL/22//



MM
[download]

In reply to Perl beginner here, needs a shove in the right direction. by rfromp

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.