Regular expressions extract, change num to word and capitalize each word after AU: tag and add 'written by'

allison has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a code that need to satisfy the following conditions 1.Change existing date formats (mm/dd/yyyy, yyyymmdd & mm-dd-yyyy) into yyyy/mm/dd format 2.Change all numeric information from 1 - 10 to appear in word format (eg. 1 = one). Ensure that date information are excluded from this change. 3.Count the number of occurrences when the word 'and' appeared in the data (may be found in any case). 4.Capitalize the first letter of each word found after the marker AU:. 5.Add 'Written by: ' at the beginning of the byline information found in <author> tag. 6.Extract the numeric id information (found in parenthesis) from the second occurrence of PP: and display this in ID:. This ID: should be placed before the AU: tag. 7.Use the first paragraph as the headline information and display contents after HL: . Remove the first paragraph to avoid repetitive information. 8.Remove the occurrences of the word 'end' in the data (may appear in any case). Question and problems - so far the no. 1 is good the only problem is I can't convert the JULY to 06 - to the no. 2, the code doesn't work because it also changes the date format - I can't think of way to count the occurrences of 'and' and to add 'Written by' - Why can't extract the numeric in () with the regular expression and how can I put the extract number to ID:, how to add ID: in the file. - I don't know the no. 7 - the word 'end' or 'END' doesn't remove. this my code

#C:\strawberry\perl\bin\perl.exe

my $filename = "input.txt";
open my $file, '<', $filename;
@fileinput=<$file>;
close($file);

#while($file)
#{ my $line =$_;
 # $line=~s/(\d{2}\/\d{2}\/\d{4})/($3\/$1\/$2)/g;

#print "@listinput";
# change date format
foreach $line(@fileinput)
{
my $testdate=($line); #= "11/09/2009";
if($testdate =~s/(\d{2})\/(\d{2})\/(\d{4})/$3\/$1\/$2/g)
 { 
   print $testdate; #=~s/(\d{2}\/\d{2}\/\d{4})/($3\/$1\/$2)/;
 }
if($testdate =~s/(\d{4})(\d{2})(\d{2})/$1\/$2\/$3/)
  {
   print $testdate;
   }
if($testdate =~s/(\w{4})\s(\d{2})\,\s(\d{4})/$3\/$1\/$2/)
   {
    print $testdate;
   }
if($testdate =~s/(\d{2})\-(\d{2})\-(\d{4})/$3\/$1\/$2/)
  { 
    print $testdate;
  }
}
#Wrong,It also change the date
#foreach $line(@fileinput)
#{
#my $numToword=($line);
#$num = "9";
#$word = "nine";
#if($numToword =~s/$num/$word/g)
#  {
#    print $numToword;
#  }
#$num = "19";
#$word = "nineteen";
#if($numToword =~s/$num/$word/g)
#  {
#    print $numToword;
#  }
#$num = "10";
#$word = "ten";
#if($numToword =~s/$num/$word/g)
#  {
#    print $numToword;
#  }
#$num = "5";
#$word = "five";
#if($numToword =~s/$num/$word/g)
#  {
#    print $numToword;
#  }
#}
# Capilize the first character after AU: 
foreach $line(@fileinput)
{ 
my $Cword=($line);
if($Cword=s/(^AU:\s[a-z])/(^AU:\s[A-Z])/)
   { 
    print $Cword;
   }
}
#extract ID in ()
foreach $line(@fileinput)
{my $extractid=($line);
if($extractid =~m/\( (\d+)\)/g)
   {
    print $extractid;
   }
}
#remove END, end word in file
foreach $line(@fileinput)
{ my $removeend=($line);
  if($removeend=~s/(^END) | (^end) | (END$) |(end$)//g)
     {
      print $removeend;
     }
}
#$line=~s/(\d{2}\/\d{2}\/\d{4})/($3\/$1\/$2)/g;
#print $line;
#}
[download]

The content of 'input.txt'

DD: 11/09/2009
AU: jas dimaano
PP: Employee ID list
PP: (489459) Jas = DS16 -> with SPi since 2005/04/04 AND with ECO ever
+ since
PP: Sam = FT35 -> resigned last 09-03-2008
PP: Evan = AT89 -> transferred last 20070605 to Journals
PP: there's more...
===
DD: july 11, 2009
AU: Jr s. Tolentino, -editor
PG: page 9
PP: Earn points now!
PP: Yes! You heard it right! (635436)
PP: Finish your exercise before the deadline and you'll receive additi
+onal points.
PP: Even if you're early by 10, 5 or even 1 minute, we'll give away th
+e corresponding points!
PP: So hurry now and ensure that you'll be able to finish ASAP! END
PP:
===
DD: 10-09-2009
AU: mr. Henderson de la cruz (h.delacruz@yahoo.com)
VL: volume IV.
PG: page 19
PP: This is only a test document.
PP: Did you know that this is just a test doc? See if you're end outpu
+t (if you'll be able to produce 1) will be the same as the required o
+ne.
PP: Don't build you're house on a sandy land. And not too near the sho
+re as well.
PP: Perhaps you'll earn more than 10-50k of money! Why not?! This does
+n't make any sense. Right? So I better end this now.
PP: This is the end of this test document.
===
[download]

the changes will be put in output.txt, thank you for the help

Comment on Regular expressions extract, change num to word and capitalize each word after AU: tag and add 'written by' Select or Download Code

Replies are listed 'Best First'.
Re: Regular expressions extract, change num to word and capitalize each word after AU: tag and add 'written by' by choroba (Cardinal) on Mar 21, 2011 at 20:26 UTC
To convert JULY to 06, you can use `s/\bJULY\b/06/` (but you have to check the context, too). You can e.g. store all the months names in an array and then replace in a loop: `my @months = qw /january february march april may june july etc/; $x = 'january february march april may june july etc'; $x =~ s/\b\Q$months[$_]\E\b/sprintf '%02d',1+$_/e for 0 .. @months-1; print "$x\n";` [download]	[reply] [d/l] [select]