Perl noob is lost

sensesfail has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys,

I've been put on a project to add page descriptions to thousands of documents and we're currently doing it manually and it is gruesome. I've been trying to learn Perl to write a script to do all of these files at once but I'm having a hard time. Any help would be appreciated!

Basically I have a directory with a ton of .ATT and .PDF files. ex 123.ATT and 123.PDF corresponding pdf file.

I was thinking that the script would read the 3 variables page_id, site_code and subject_id and check if it follows the rules before it would proceed.

Something along the lines of: 1) if any of the 3 variables have a blank then it would skip it
2) if it has a "?" in it. Any of the 3 variables has a question mark in any position. subject_id= 231?23422. or site_code=12? it would skip it.

Finally once it passes the criteria it would read the page_id=### and then enter the corresponding "some text" into the page_description="some text". Such as 27.### ( I'm assuming # is a wild card for any number) = information 1 28.### = information 2.
I would like to autofill the page description depending on the page_id.

The document is formatted this way.

OBJECT=(removed)
page_id=#### (usually 3-5 digits)
page_description=
product=(removed)
study_number=(removed)
content_provider=(removed)
site_code=###
subject_id=#########
CONTENT=test.pdf
SAVE
[download]

(not necessary but great if you can do it) If it can save the new file in a folder such as "resolved" and move a copy of the original in a folder called "original"

Comment on Perl noob is lost Download Code

Replies are listed 'Best First'.
Re: Perl noob is lost by ig (Vicar) on Apr 05, 2009 at 08:52 UTC
Here is something that might get you started. `use strict; use warnings; use Data::Dumper; foreach my $file (glob(".ATT")) { open(my $fh, '<', $file) or die "$file: $!"; my %params = map { chomp($_); my ($key, $value) = split(/=/, $_); $value="" unless(defined($value)); ($key, $value) } <$fh>; close($fh); next if( $params{'page_id'} =~ m/(^\s$\|\?)/ or $params{'site_code'} =~ m/(^\s$\|\?)/ or $params{'subject_id'} =~ m/(^\s$\|\?)/ ); $params{'page_description'} .= "some text - $params{'page_id'}"; print Dumper(\%params); }` [download] Given a file test.ATT with the content as in your example, this produces: `$VAR1 = { 'content_provider' => '(removed)', 'page_id' => '#### (usually 3-5 digits)', 'OBJECT' => '(removed)', 'SAVE' => '', 'study_number' => '(removed)', 'CONTENT' => 'test.pdf', 'page_description' => 'some text - #### (usually 3-5 digits) +', 'subject_id' => '#########', 'site_code' => '###', 'product' => '(removed)' };` [download] To understand how this works you may find some of the perl manual pages helpful (perldata, perlop, perlfunc, perlsyn, perlre) and, for a gentler introduction you might like to start from Where and how to start learning Perl. Good luck learning perl! update: corrected the update of page_description.	[reply] [d/l] [select]
Re^2: Perl noob is lost by sensesfail (Initiate) on Apr 05, 2009 at 15:48 UTC
Thank you so much for helping me out!. I was thinking it would require the split function and i had the "=~ m/(^\s*$\|\?)/" part down. Is there a way to retain the original format of the document? These documents will be read by another program. `OBJECT=(removed) page_id=#### (usually 3-5 digits) page_description=GRAPHICS product=(removed) study_number=(removed) content_provider=(removed) site_code=### subject_id=######### CONTENT=test.pdf SAVE` [download]	[reply] [d/l]
Re^3: Perl noob is lost by ig (Vicar) on Apr 05, 2009 at 18:29 UTC
There are many alternatives. Here is one way that keeps the overall layout of the file: use strict; use warnings; use Data::Dumper; foreach my $file (glob(".ATT")) { open(my $fh, '<', $file) or die "$file: $!"; my $content = do { local $/; <$fh> }; close($fh); my ( $page_id ) = $content =~ m/^page_id=(.)/m; my ( $site_code ) = $content =~ m/^site_code=(.)/m; my ( $subject_id ) = $content =~ m/^subject_id=(.)/m; my ( $page_description ) = $content =~ m/^page_description=(.)/m; next if( $page_id =~ m/(^\s$\|\?)/ or $site_code =~ m/(^\s$\|\?)/ or $subject_id =~ m/(^\s$\|\?)/ ); $page_description .= "some text - $page_id"; $content =~ s/^page_description=.*/page_description=$page_descript +ion/m; print "$content"; } [download] In this case, the entire file content is read into a single string (slurped) then pattern matching and substitution are used to extract data and modify the description without chaning the overall layout. The result is: `OBJECT=(removed) page_id=#### (usually 3-5 digits) page_description=some text - #### (usually 3-5 digits) product=(removed) study_number=(removed) content_provider=(removed) site_code=### subject_id=######### CONTENT=test.pdf SAVE` [download]	[reply] [d/l] [select]
Re^4: Perl noob is lost by sensesfail (Initiate) on Apr 06, 2009 at 01:24 UTC
Re^5: Perl noob is lost by ig (Vicar) on Apr 06, 2009 at 21:10 UTC
Re: Perl noob is lost by Anonymous Monk on Apr 05, 2009 at 08:33 UTC
perlintro	[reply]