Extracting variable-length strings between delimiters

PMReader has asked for the wisdom of the Perl Monks concerning the following question:

I had a quick-and-dirty script that worked with the substring function as long as the offset and length of text were the same. This is no longer true. Now the length of text to be extracted is variable (and sometimes contains spaces).

I can determine the text around the substring. Let's say start and end.

I thought I could more generically extract the substring with split /start($_)end/ but I can't get the syntax right and it won't compile.

Here's the original script. It works, finding one $transactID and a dozen or so item/quantity instances in each file, but it's inflexible:

#!/usr/bin/perl -w

use strict;
use warnings;

my $transactID    = "item=" ;          #### Usually a number
my $itemname     = "itemname" ;   #### Usually alphanumeric
my $quantity        = "qty" ;              #### Always numeric


open(IN,'ItemsFile.rtf') or die("can't open input file\n");
open(OUT,'>results.txt') or die("can't open output file\n");  

while(<IN>){

  print OUT "0 ", substr($',3,17), ";  " if($_ =~/\b$transactID\b/i); 

  print OUT "1 ", substr($',2,12), ";  " if($_ =~/\b$itemID\b/i); 

  print OUT "2 ", substr($',13,3), ";  " if($_ =~/\b$quantity\b/i);   
+ 

}

close(IN);
close(OUT);
[download]

Note that start sometimes contains spaces.

Comment on Extracting variable-length strings between delimiters Download Code

" tag for the item and quantity.

Replies are listed 'Best First'.
Re: Extracting variable-length strings between delimiters by 7stud (Deacon) on Feb 12, 2010 at 04:21 UTC
`use strict; use warnings; use 5.010; my $string ="startHELLO WORLDend"; $string =~ /start(.?)end/; say $1; --output:-- HELLO WORLD` [download] ======== `use strict; use warnings; use 5.010; my @strings = ( 'startHELLO WORLDend', 'startHIend', 'startOK BYEend', ); for (@strings) { /start(.?)end/; say $1; } --output:-- HELLO WORLD HI OK BYE` [download]	[reply] [d/l] [select]
Re: Extracting variable-length strings between delimiters by Anonymous Monk on Feb 12, 2010 at 00:54 UTC
`#!/usr/bin/perl -- use strict; use warnings; #~ Main(@ARGV); Main( 'start', 'end', 'start something between end' ); exit(0); sub Main { my ( $st, $en, $in ) = @_; print "$1\n" if $in =~ /\Q$st\E(.+?)\Q$en\E/; print join "\n", map { "{$_}" } split /(\Q$st\E)(.+?)(\Q$en\E)/, $in +; } ## end sub Main __END__ something between {} {start} { something between } {end}` [download] perlintro, perlretut, perlre	[reply] [d/l]
Re: Extracting variable-length strings between delimiters by ahmad (Hermit) on Feb 12, 2010 at 02:04 UTC
Where's your data sample ?	[reply]
Re^2: Extracting variable-length strings between delimiters by PMReader (Initiate) on Feb 12, 2010 at 06:31 UTC
While I'm looking at the responses, here's the requested data sample (read in from a file with a consistent name): ...<tr><td colspan="3"> <span><strong>SomeNameHere</strong></span> <span class="small">(Transa +ction ID #5HN04039SW052A35R)</span><br><br> </td></tr> <tr><td colspan="3"><hr class="dotted"></td></tr> <tr><td colspan="3"><br class="h10"></td></tr> ...<td class="item-title" width="40%">Name of the first item</td> <td align="center" class="qty" width="9%">14</td>... <tr><td colspan="3"><hr class="dotted"></td></tr> <tr><td colspan="3"><br class="h10"></td></tr> ...<td class="item-title" width="40%">Name of the second item</td> <td align="center" class="qty" width="9%">12</td>... [download] From which I want to extract: 5HN04039SW052A35R Name of the first item 14 Name of the second item 12 ... As you can see, there are varying lengths of unpredictable non-unique characters, quotation marks and angle brackets between the start string and the desired text. What is known is that they are always preceded somewhere by "Transaction ID", "item-title" and "qty" and terminated with a parenthesis for the transaction ID and "	[reply] [d/l]
Re^3: Extracting variable-length strings between delimiters by ahmad (Hermit) on Feb 12, 2010 at 08:39 UTC
Since what you have is HTML I think it's better to use html parser module for this job #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser; my $Text; { local $/; $Text = <DATA>; } my $p = HTML::TokeParser->new( \$Text ); while ( my $token = $p->get_tag('td') ) { my $txt = $p->get_trimmed_text("/td"); print $txt,"\n"; } __DATA__ ...<tr><td colspan="3"> <span><strong>SomeNameHere</strong></span> <span class="small">(Transa +ction ID #5HN04039SW052A35R)</span><br><br> </td></tr> <tr><td colspan="3"><hr class="dotted"></td></tr> <tr><td colspan="3"><br class="h10"></td></tr> ...<td class="item-title" width="40%">Name of the first item</td> <td align="center" class="qty" width="9%">14</td>... <tr><td colspan="3"><hr class="dotted"></td></tr> <tr><td colspan="3"><br class="h10"></td></tr> ...<td class="item-title" width="40%">Name of the second item</td> <td align="center" class="qty" width="9%">12</td>... [download] Untested	[reply] [d/l]