allison has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a code that need to satisfy the following conditions 1.Change existing date formats (mm/dd/yyyy, yyyymmdd & mm-dd-yyyy) into yyyy/mm/dd format 2.Change all numeric information from 1 - 10 to appear in word format (eg. 1 = one). Ensure that date information are excluded from this change. 3.Count the number of occurrences when the word 'and' appeared in the data (may be found in any case). 4.Capitalize the first letter of each word found after the marker AU:. 5.Add 'Written by: ' at the beginning of the byline information found in <author> tag. 6.Extract the numeric id information (found in parenthesis) from the second occurrence of PP: and display this in ID:. This ID: should be placed before the AU: tag. 7.Use the first paragraph as the headline information and display contents after HL: . Remove the first paragraph to avoid repetitive information. 8.Remove the occurrences of the word 'end' in the data (may appear in any case). Question and problems - so far the no. 1 is good the only problem is I can't convert the JULY to 06 - to the no. 2, the code doesn't work because it also changes the date format - I can't think of way to count the occurrences of 'and' and to add 'Written by' - Why can't extract the numeric in () with the regular expression and how can I put the extract number to ID:, how to add ID: in the file. - I don't know the no. 7 - the word 'end' or 'END' doesn't remove. this my code
#C:\strawberry\perl\bin\perl.exe my $filename = "input.txt"; open my $file, '<', $filename; @fileinput=<$file>; close($file); #while($file) #{ my $line =$_; # $line=~s/(\d{2}\/\d{2}\/\d{4})/($3\/$1\/$2)/g; #print "@listinput"; # change date format foreach $line(@fileinput) { my $testdate=($line); #= "11/09/2009"; if($testdate =~s/(\d{2})\/(\d{2})\/(\d{4})/$3\/$1\/$2/g) { print $testdate; #=~s/(\d{2}\/\d{2}\/\d{4})/($3\/$1\/$2)/; } if($testdate =~s/(\d{4})(\d{2})(\d{2})/$1\/$2\/$3/) { print $testdate; } if($testdate =~s/(\w{4})\s(\d{2})\,\s(\d{4})/$3\/$1\/$2/) { print $testdate; } if($testdate =~s/(\d{2})\-(\d{2})\-(\d{4})/$3\/$1\/$2/) { print $testdate; } } #Wrong,It also change the date #foreach $line(@fileinput) #{ #my $numToword=($line); #$num = "9"; #$word = "nine"; #if($numToword =~s/$num/$word/g) # { # print $numToword; # } #$num = "19"; #$word = "nineteen"; #if($numToword =~s/$num/$word/g) # { # print $numToword; # } #$num = "10"; #$word = "ten"; #if($numToword =~s/$num/$word/g) # { # print $numToword; # } #$num = "5"; #$word = "five"; #if($numToword =~s/$num/$word/g) # { # print $numToword; # } #} # Capilize the first character after AU: foreach $line(@fileinput) { my $Cword=($line); if($Cword=s/(^AU:\s[a-z])/(^AU:\s[A-Z])/) { print $Cword; } } #extract ID in () foreach $line(@fileinput) {my $extractid=($line); if($extractid =~m/\( (\d+)\)/g) { print $extractid; } } #remove END, end word in file foreach $line(@fileinput) { my $removeend=($line); if($removeend=~s/(^END) | (^end) | (END$) |(end$)//g) { print $removeend; } } #$line=~s/(\d{2}\/\d{2}\/\d{4})/($3\/$1\/$2)/g; #print $line; #}
The content of 'input.txt'
DD: 11/09/2009 AU: jas dimaano PP: Employee ID list PP: (489459) Jas = DS16 -> with SPi since 2005/04/04 AND with ECO ever + since PP: Sam = FT35 -> resigned last 09-03-2008 PP: Evan = AT89 -> transferred last 20070605 to Journals PP: there's more... === DD: july 11, 2009 AU: Jr s. Tolentino, -editor PG: page 9 PP: Earn points now! PP: Yes! You heard it right! (635436) PP: Finish your exercise before the deadline and you'll receive additi +onal points. PP: Even if you're early by 10, 5 or even 1 minute, we'll give away th +e corresponding points! PP: So hurry now and ensure that you'll be able to finish ASAP! END PP: === DD: 10-09-2009 AU: mr. Henderson de la cruz (h.delacruz@yahoo.com) VL: volume IV. PG: page 19 PP: This is only a test document. PP: Did you know that this is just a test doc? See if you're end outpu +t (if you'll be able to produce 1) will be the same as the required o +ne. PP: Don't build you're house on a sandy land. And not too near the sho +re as well. PP: Perhaps you'll earn more than 10-50k of money! Why not?! This does +n't make any sense. Right? So I better end this now. PP: This is the end of this test document. ===
the changes will be put in output.txt, thank you for the help
  • Comment on Regular expressions extract, change num to word and capitalize each word after AU: tag and add 'written by'
  • Select or Download Code

Replies are listed 'Best First'.
Re: Regular expressions extract, change num to word and capitalize each word after AU: tag and add 'written by'
by choroba (Cardinal) on Mar 21, 2011 at 20:26 UTC
    To convert JULY to 06, you can use s/\bJULY\b/06/ (but you have to check the context, too). You can e.g. store all the months names in an array and then replace in a loop:
    my @months = qw /january february march april may june july etc/; $x = 'january february march april may june july etc'; $x =~ s/\b\Q$months[$_]\E\b/sprintf '%02d',1+$_/e for 0 .. @months-1; print "$x\n";