Scenario:

The following text belongs to a .doc file

File: check1.asm Function: Monks Tag: No Tag: 001 Tag: Yes Tag: 002 File: check2.asm Function: Perl Monks Tag: Yes Tag: 003 Tag: No Tag: 004 File: check3.asm Function: Experts Tag: No Tag: 005 Tag: No Tag: 006 Function: Perl Experts Tag: No Tag: 007 Tag: Yes Tag: 008

I have to extract the tag which have been tagged as Yes and the corresponding function and file name to an excel sheet..

The output have to be like this:

Tags Function File 002 Monks check1.asm 003 Perl Monks check2.asm 008 Perl Experts check3.asm

I have written the following snippet for extracting the tag which is categorized as Yes :

use strict; use warnings; use Win32::OLE; use Win32::OLE qw(in with); use Win32::OLE::Variant; use Win32::OLE::Const 'Microsoft Excel'; use Win32::OLE::Const 'Microsoft Word'; use Cwd; use File::Find; use Win32::OLE; use Win32::OLE::Enum; $Win32::OLE::Warn = 3; # die on errors. +.. my $out_file = 'check.xls'; open my $out_fh, '>', $out_file or die "Could not open file $out_file: +$!"; my $print_next = 0; #Globals our $Word; our $reviewchklists; my @scriptfiles; @scriptfiles=glob('*.doc'); foreach my $file (@scriptfiles) { my $var; my $filename = "D\:\\"; $var = $filename."$file"; print $var ; my $document = Win32::OLE -> GetObject("$var"); print "Extracting Text ...\n"; my @array; my $paragraphs = $document->Paragraphs(); my $enumerate = new Win32::OLE::Enum($paragraphs); while(my $paragraph = $enumerate->Next()) { my $text = $paragraph->{Range}->{Text}; $text =~ s/[\n\r\t]//g; $text =~ s/\x0B/\n/g; $text =~ s/\x07//g; chomp $text; my $Data .= $text; @array=split(/\.$/,$Data); foreach my $line( @array) { if ($print_next) { print $out_fh $line."\n" ; # we add a "\n" ; #No n +eed to chomp - we print the "\n" local $\ = "<br>\n"; local $/="\n\n"; } $print_next = ($line =~ /^Tag\sYes/); } } } #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The above snippet is printing the output as follows:

ID : 002 ID : 003 ID : 008

I dont want the ID to be printed and how to extract the corresponding function and file name?

Help out monks!!!


In reply to Extract Multiple Tags by stallion

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.