regex help

jai has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file with contents similar to the following:

5.1. GetTagBytestoWrite
 This function returns the number of bytes taken by the Tag of the ASN
+ object. It Scans through the BER/DER encoded String and finds the no
+ of bytes taken by the Tag of a given ASN.1 Object.

Prototype:
 int
GetTagbytestoWrite(unsigned char *tstr,int *count)
Parameters:
*tstr Contents
of the ASN.1 Object in a string
*count a
pointer to an integer to hold the address of the variable holding the 
+number of
bytes the tag value takes to store itself.

5.2. GetLenBytestoWrite
This function returns the no of Octets taken by the Length
field of a given ASN.1 Object. It Scans through the BER/DER encoded St
+ring and finds the no of bytes taken by the Length field of a given A
+SN.1 Object.
Prototype:
 int
GetLenbytestoWrite(unsigned char *pstr,int *count)
Parameters:
*pstr Contents
..
[download]

I need to extract the function definition from the text file. The output should be similar to the following.

5.1. GetTagBytestoWrite
 This function returns the number of bytes taken by the Tag of the ASN
+ object. It Scans through the BER/DER encoded String and finds the no
+ of bytes taken by the Tag of a given ASN.1 Object.

5.2. GetLenBytestoWrite
This function returns the no of Octets taken by the Length
field of a given ASN.1 Object. It Scans through the BER/DER encoded St
+ring and finds the no of bytes taken by the Length field of a given A
+SN.1 Object.
..
[download]

I did something like this..

#!/usr/bin/perl

my ($buf);

open (FILE,"./ASN_tech.htm") or die "Unable to open: $!";
$buf=join '',<FILE>;
close FILE;

$buf=~s/(\d*\.\d*\.)\s*(\w+)(.*)?(Prototype:)/print "$1 $2\n"/gem;
[download]

But this doesnt seem to work.. any help would be greatly appreciated..

jai

Comment on regex help Select or Download Code

Replies are listed 'Best First'.
Re: regex help by tachyon (Chancellor) on Sep 16, 2003 at 07:10 UTC
In a nutshell you seem to assume .* will match newlines. It won't without a /s When you add a /s to .* it will match everything, so you generally need .? to make it less greedy but this involves backtracking and is inefficient. See Death To Dot Star!. Note dot star does have its uses, but there are often better ways to skin your cat. You also use m/\d\.\d\./ which will match '..' which is not what you want. It should be \d+ in the context shown. Essentially all you are doing is grabbing two lines at a time (by the looks). Here are some example approaches. The best one depends on the exact format and consitency of the data. Where possible using the input record separator to chunkify an input stream into records is often the easiest approach. You can often then just split the record to get the bits you want. Anyway: # example 1 my $flag = 0; while(<DATA>) { next unless $flag or m/^\d+\.\d+/; print; $flag ^= 1; } # example 2 depends on newline before Prototype section # using input record separtator to do the work $/ = "\n\n"; while(<DATA>) { print if m/^\d+\.\d+\./; } # example 3 one of many possible REs local $/; $data = <DATA>; @chunks = $data =~ m/^(\d+\.\d+\.[^\n]+\n[^\n]+\n)/gm; print @chunks; # example 4 yet another way to do it (LIKE YOUR ORIGINAL) local $/; $data = <DATA>; @chunks = $data =~ m/(^\d+\.\d+\..?(?=Prototype:))/gsm; print @chunks; __DATA__ 5.1. GetTagBytestoWrite This function returns the number of bytes taken by the Tag of the ASN +object. It Scans through the BER/DER encoded String and finds the no +of bytes taken by the Tag of a given ASN.1 Object. Prototype: int GetTagbytestoWrite(unsigned char tstr,int count) Parameters: tstr Contents of the ASN.1 Object in a string count a pointer to an integer to hold the address of the variable hol +ding the number of bytes the tag value takes to store itself. 5.2. GetLenBytestoWrite This function returns the no of Octets taken by the Length field of a +given ASN.1 Object. It Scans through the BER/DER encoded String and f +inds the no of bytes taken by the Length field of a given ASN.1 Objec +t. Prototype: int GetLenbytestoWrite(unsigned char pstr,int count) Parameters: *pstr Contents .. [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: regex help by leriksen (Curate) on Sep 16, 2003 at 07:55 UTC
another take - using much of the same constructs as the OP, but also being aware of Tachyon's advice on .* - generally if you use a , think twice. e.g. you originally used `\d\.\d.` - but if you tink about it, you probably done want to match '..' which \d match (zero or more matches) #!/usr/bin/perl -w use strict; my @buf = <DATA>; chomp @buf; # chop of all the line endings my $doc = join ' ', @buf; # one nice juicy line my $doc_id = qr(\d+\.\d+\.); # precomp regex - also self documenting while ($doc =~ /($doc_id)\s+(\w+)(.?)Prototype:.?(?=$doc_id)?/g) { print "$1 $2 $3\n"; } __DATA__ 5.1. GetAFunctionHere This is the description of a function Which spans a few lines Prototype: int GetAFunctionHere(int count) Parameters: count probably some kind of spelling object 5.2. GetMoreFunctionsHere This is the description of another function Which spans a few more lines Prototype: int GetMoreFunctionHere(float crash) Parameters: crash probably some kind of rowing object [download] the (?=...) is a fancy way of saying 'look ahead till you find another doc_id string, but also start the next match from that doc_id' - it is called a zero-width positive look ahead assertion - zero width means it does not go into the $& match string (I think)(strings are normally consumed by $& as they match), positive means find this pattern in the source string (as opposed to 'find not this pattern in the string'), look ahead means peek forward, but do not move the regex internal position counter (accessible via pos()) and assertion, well that just seems redundant - like you'd go to all that trouble then say 'but that's just a suggestion'. I put the trailing ? after the (?=...) to catch the last doc string	[reply] [d/l] [select]
Re: regex help by delirium (Chaplain) on Sep 16, 2003 at 12:46 UTC
This would be a great job for a Perl one-liner with paragraph mode enabled. If there were a blank line between all the function definitions and the prototypes, you could use a simple: `perl -n00e 'print if /^\d/;' filename` However, since the second prototype doesn't have a blank line, you need to take that into account, with something like: `perl -n00e '/^(\d(.(?!\nPrototype))+)/ms; print $1' filename` This will search for any paragraph that starts with a number, then look for as many characters as it can find that aren't followed by \nPrototype. If the file doesn't have a blank line before each function definition (except the first), then this won't work, though. I noticed your code references an HTML file. Are we looking at the raw file here, or the cut-n-paste from a browser? If the raw file is laid out with HTML tags, that will complicate things, and these solutions may not work.	[reply] [d/l] [select]