in reply to Communication of program(s) with regex codeblocks

What would be an example of an ideal container?
  • Comment on Re: Communication of program(s) with regex codeblocks

Replies are listed 'Best First'.
Re^2: Communication of program(s) with regex codeblocks
by erix (Prior) on Oct 26, 2004 at 15:46 UTC

    The ideal container program would slurp the regexfile, then apply it to whatever text that is to be searched. In the case of a match, it would somehow (and this is my problem) be able to retrieve the captured values (in the proper order) that are in the regex. This does work in the above regex example, but that is ugly because it needs the program to know what to do with '$hashname'.

    My example regex above has this silly ${$hashname} stuff, which (in the test container program I have) is changed into a real variable name for the hash. I have copied the sub below. I've added some comments, variables contain what they are named after.

    And it may well be that it is just not feasible without taking recourse to globals.

    This sub is called before compiling the regexes. The substitution is its main function.

    sub get_regexes_prepare { # replaces all ${$hashname}{'abc'} no strict 'refs'; my ($pckg,$rregexes) = @_; # packagename and hashref are passed my @regexes = @{$rregexes}; my @hashnames = (); $#hashnames = $#regexes; # same size for (my $i=0; $i < $#regexes+1; $i++) { my $hashname = "${pckg}::hashname".$i; # construct hashname $hashnames[$i] = $hashname; $regexes[$i] =~ s/\$hashname/\$$hashname/g; tie %${$hashname}, "Tie::IxHash"; # keep order } return (\@regexes,\@hashnames); }

    I'll later (tomorrow or so) post more complete code. But as you can probably guess from this sub code, it needs some cleaning up and removing of some experimental stuff :)

    Thanks

      The ideal container program would slurp the regexfile, then apply it to whatever text that is to be searched. In the case of a match, it would somehow be able to retrieve the captured values in the proper order.

      Doesn't the following do just that?

      my $re = do { local *FILE; open(FILE, '<', $regexp_file_name) or die('...'); local $/; qr/@{[<FILE>]}/ # Compile only once. }; while (<DATA>) { if (@captures = $_ =~ $re) { print(join(', ', @captures), $/); } } __DATA__ abd 123 sdafas 231 gdabd 7364 112 sdafas 785 regexp file (Matches lines with two words of exactly 3 digits.) =========== \b(\d{3})\b.*?\b(\d{3})\b output ====== 123, 231 112, 785

        My 'specification' was incomplete, and I apologise for that. I will give some context to make clear what I am trying to figure out.

        Ultimate goal: extracting information from a wide variety of text files, representing published articles.

        With a largish set of text files, most of them needing their own set of regexes (to split them into usable parts), it would seem to make sense to store regexfile with textfile. I foresee that a lot of strings thus extracted will subsequently (to get that final info-nugget for my database) need unique code. It would make sense to also keep such code associated with the textfile. For this, the codeblocks are one candidate, modules are another, snippet files for future eval yet another.

        (a concrete example of the textbase: from one author all published articles, 600 text files, ranging from 1 page to 100+ pages, published over a period of 40 years in several journals.)

        So there is the context in which I am trying to find out how 'heavy' those regex codeblocks can be 'loaded' with code, and what communication/steering/instrumentation can be dreamed up. I initially hoped that DBI searches would be possible. This turns out to be almost certainly impossible/unwise, because absolutely no regex can be used inside a codeblock. But I believe subroutines and closures can be defined and called.

        So what I'm trying is to logically split very 'specific' code over a text-centric system with many disparate textfiles. It may be a Bad Thing - I am not advocating it, but hope to gain from experience of others on similar exploits.