sensesfail has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys,

I've been put on a project to add page descriptions to thousands of documents and we're currently doing it manually and it is gruesome. I've been trying to learn Perl to write a script to do all of these files at once but I'm having a hard time. Any help would be appreciated!

Basically I have a directory with a ton of .ATT and .PDF files. ex 123.ATT and 123.PDF corresponding pdf file.

I was thinking that the script would read the 3 variables page_id, site_code and subject_id and check if it follows the rules before it would proceed.

Something along the lines of: 1) if any of the 3 variables have a blank then it would skip it
2) if it has a "?" in it. Any of the 3 variables has a question mark in any position. subject_id= 231?23422. or site_code=12? it would skip it.

Finally once it passes the criteria it would read the page_id=### and then enter the corresponding "some text" into the page_description="some text". Such as 27.### ( I'm assuming # is a wild card for any number) = information 1 28.### = information 2.
I would like to autofill the page description depending on the page_id.

The document is formatted this way.

OBJECT=(removed) page_id=#### (usually 3-5 digits) page_description= product=(removed) study_number=(removed) content_provider=(removed) site_code=### subject_id=######### CONTENT=test.pdf SAVE

(not necessary but great if you can do it) If it can save the new file in a folder such as "resolved" and move a copy of the original in a folder called "original"

Replies are listed 'Best First'.
Re: Perl noob is lost
by ig (Vicar) on Apr 05, 2009 at 08:52 UTC

    Here is something that might get you started.

    use strict; use warnings; use Data::Dumper; foreach my $file (glob("*.ATT")) { open(my $fh, '<', $file) or die "$file: $!"; my %params = map { chomp($_); my ($key, $value) = split(/=/, $_); $value="" unless(defined($value)); ($key, $value) } <$fh>; close($fh); next if( $params{'page_id'} =~ m/(^\s*$|\?)/ or $params{'site_code'} =~ m/(^\s*$|\?)/ or $params{'subject_id'} =~ m/(^\s*$|\?)/ ); $params{'page_description'} .= "some text - $params{'page_id'}"; print Dumper(\%params); }

    Given a file test.ATT with the content as in your example, this produces:

    $VAR1 = { 'content_provider' => '(removed)', 'page_id' => '#### (usually 3-5 digits)', 'OBJECT' => '(removed)', 'SAVE' => '', 'study_number' => '(removed)', 'CONTENT' => 'test.pdf', 'page_description' => 'some text - #### (usually 3-5 digits) +', 'subject_id' => '#########', 'site_code' => '###', 'product' => '(removed)' };

    To understand how this works you may find some of the perl manual pages helpful (perldata, perlop, perlfunc, perlsyn, perlre) and, for a gentler introduction you might like to start from Where and how to start learning Perl.

    Good luck learning perl!

    update: corrected the update of page_description.

      Thank you so much for helping me out!. I was thinking it would require the split function and i had the "=~ m/(^\s*$|\?)/" part down.

      Is there a way to retain the original format of the document? These documents will be read by another program.

      OBJECT=(removed) page_id=#### (usually 3-5 digits) page_description=GRAPHICS product=(removed) study_number=(removed) content_provider=(removed) site_code=### subject_id=######### CONTENT=test.pdf SAVE

        There are many alternatives. Here is one way that keeps the overall layout of the file:

        use strict; use warnings; use Data::Dumper; foreach my $file (glob("*.ATT")) { open(my $fh, '<', $file) or die "$file: $!"; my $content = do { local $/; <$fh> }; close($fh); my ( $page_id ) = $content =~ m/^page_id=(.*)/m; my ( $site_code ) = $content =~ m/^site_code=(.*)/m; my ( $subject_id ) = $content =~ m/^subject_id=(.*)/m; my ( $page_description ) = $content =~ m/^page_description=(.*)/m; next if( $page_id =~ m/(^\s*$|\?)/ or $site_code =~ m/(^\s*$|\?)/ or $subject_id =~ m/(^\s*$|\?)/ ); $page_description .= "some text - $page_id"; $content =~ s/^page_description=.*/page_description=$page_descript +ion/m; print "$content"; }

        In this case, the entire file content is read into a single string (slurped) then pattern matching and substitution are used to extract data and modify the description without chaning the overall layout. The result is:

        OBJECT=(removed) page_id=#### (usually 3-5 digits) page_description=some text - #### (usually 3-5 digits) product=(removed) study_number=(removed) content_provider=(removed) site_code=### subject_id=######### CONTENT=test.pdf SAVE
Re: Perl noob is lost
by Anonymous Monk on Apr 05, 2009 at 08:33 UTC