PMReader has asked for the wisdom of the Perl Monks concerning the following question:

I had a quick-and-dirty script that worked with the substring function as long as the offset and length of text were the same. This is no longer true. Now the length of text to be extracted is variable (and sometimes contains spaces).

I can determine the text around the substring. Let's say start and end.

I thought I could more generically extract the substring with split /start($_)end/ but I can't get the syntax right and it won't compile.

Here's the original script. It works, finding one $transactID and a dozen or so item/quantity instances in each file, but it's inflexible:

#!/usr/bin/perl -w use strict; use warnings; my $transactID = "item=" ; #### Usually a number my $itemname = "itemname" ; #### Usually alphanumeric my $quantity = "qty" ; #### Always numeric open(IN,'ItemsFile.rtf') or die("can't open input file\n"); open(OUT,'>results.txt') or die("can't open output file\n"); while(<IN>){ print OUT "0 ", substr($',3,17), "; " if($_ =~/\b$transactID\b/i); print OUT "1 ", substr($',2,12), "; " if($_ =~/\b$itemID\b/i); print OUT "2 ", substr($',13,3), "; " if($_ =~/\b$quantity\b/i); + } close(IN); close(OUT);

Note that start sometimes contains spaces.

" tag for the item and quantity.

Replies are listed 'Best First'.
Re: Extracting variable-length strings between delimiters
by 7stud (Deacon) on Feb 12, 2010 at 04:21 UTC
    use strict; use warnings; use 5.010; my $string ="startHELLO WORLDend"; $string =~ /start(.*?)end/; say $1; --output:-- HELLO WORLD

    ========

    use strict; use warnings; use 5.010; my @strings = ( 'startHELLO WORLDend', 'startHIend', 'startOK BYEend', ); for (@strings) { /start(.*?)end/; say $1; } --output:-- HELLO WORLD HI OK BYE
Re: Extracting variable-length strings between delimiters
by Anonymous Monk on Feb 12, 2010 at 00:54 UTC
    #!/usr/bin/perl -- use strict; use warnings; #~ Main(@ARGV); Main( 'start', 'end', 'start something between end' ); exit(0); sub Main { my ( $st, $en, $in ) = @_; print "$1\n" if $in =~ /\Q$st\E(.+?)\Q$en\E/; print join "\n", map { "{$_}" } split /(\Q$st\E)(.+?)(\Q$en\E)/, $in +; } ## end sub Main __END__ something between {} {start} { something between } {end}
    perlintro, perlretut, perlre
Re: Extracting variable-length strings between delimiters
by ahmad (Hermit) on Feb 12, 2010 at 02:04 UTC

    Where's your data sample ?

      While I'm looking at the responses, here's the requested data sample (read in from a file with a consistent name):

      ...<tr><td colspan="3"> <span><strong>SomeNameHere</strong></span> <span class="small">(Transa +ction ID #5HN04039SW052A35R)</span><br><br> </td></tr> <tr><td colspan="3"><hr class="dotted"></td></tr> <tr><td colspan="3"><br class="h10"></td></tr> ...<td class="item-title" width="40%">Name of the first item</td> <td align="center" class="qty" width="9%">14</td>... <tr><td colspan="3"><hr class="dotted"></td></tr> <tr><td colspan="3"><br class="h10"></td></tr> ...<td class="item-title" width="40%">Name of the second item</td> <td align="center" class="qty" width="9%">12</td>...

      From which I want to extract:

      5HN04039SW052A35R

      Name of the first item 14

      Name of the second item 12

      ...

      As you can see, there are varying lengths of unpredictable non-unique characters, quotation marks and angle brackets between the start string and the desired text. What is known is that they are always preceded somewhere by "Transaction ID", "item-title" and "qty" and terminated with a parenthesis for the transaction ID and "

        Since what you have is HTML I think it's better to use html parser module for this job

        #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser; my $Text; { local $/; $Text = <DATA>; } my $p = HTML::TokeParser->new( \$Text ); while ( my $token = $p->get_tag('td') ) { my $txt = $p->get_trimmed_text("/td"); print $txt,"\n"; } __DATA__ ...<tr><td colspan="3"> <span><strong>SomeNameHere</strong></span> <span class="small">(Transa +ction ID #5HN04039SW052A35R)</span><br><br> </td></tr> <tr><td colspan="3"><hr class="dotted"></td></tr> <tr><td colspan="3"><br class="h10"></td></tr> ...<td class="item-title" width="40%">Name of the first item</td> <td align="center" class="qty" width="9%">14</td>... <tr><td colspan="3"><hr class="dotted"></td></tr> <tr><td colspan="3"><br class="h10"></td></tr> ...<td class="item-title" width="40%">Name of the second item</td> <td align="center" class="qty" width="9%">12</td>...

        Untested