The information I am trying to extract lies between the <notes > and </notes> tags (which may or may not fall across lines. Once extracted I want to put the information into 1 file which has the information on each line (line returns stripped). This is what I have created so far, being new to Perl, I have added bits and pieces from scripts found on the site.
I am having difficulty removing the line returns from the final output file. Also my current output gives me text which is outside of the notes tag, but this is only from the second record.
Can anyone help?
#!/usr/bin/perl -w use strict; use File::Find; use HTML::TokeParser; ##Here define the directory to work across my $root_dir = 'c:/test1'; ##Search the directory, when a file is found run the sub. find(\&wanted, $root_dir); sub wanted { # if the extension fits... if ( /(LOG[^\n]*)|(REC[^\n]*)\.xml?/i ) { ##Grab the filename for error to screen if cannot open. my $input = $_; open (OUTPUT, ">>c:\\1-Actnte.txt"); open INPUT, "$input" or die "Cannot open $input"; select OUTPUT; $\ = "\n"; my $foundstart; while (<INPUT>) { chomp; next unless ($foundstart || /<notes[^>]*>/i); while (/<notes[^>]*>/i && ! $foundstart) { $_ =~ s/^.*?<notes [^>]*>$/\n<notes $1\/i; $foundstart++; next unless($_); } while ($_ =~ m|<notes[^\r\n]*</notes>|i) { $_ =~ s|^(.*?)</notes>.*$|$1|i; print if($_); last; } print; } close INPUT; } } close OUTPUT;
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |