Drigan has asked for the wisdom of the Perl Monks concerning the following question:

Hey Monks, I'm looking through a large amount of data on a regular basis to make sure that no copyrights get missed. To spot a line that has a copyright in it, I'm currently looking for this:

^(?=.*\Wcopyright(\W|ed\W)|\s\(c\)\s)(?=.*\D(19|20)\d\d\D)

This catches anything with a year from 1900 to 2099, and the words "Copyright", "Copyrighted", or, "(c)" This does the job, but . . . I'd like to crank up the speed any way that I can, and looking through a 100,000 character (binary) line twice seems a little unnecessary when I know that the year will be within 200 characters of the copyright announcement.

I thought I could anchor off of the copyright announcement, and then look at my previous 200 and next 200 characters to see if they contain a year, eliminating nearly half of my search zone for long strings.

Something like this:
^.*Wcopyright(\W|ed\W)|\s\(c\)\s((?=.{0,200}\D(19|20)\d\d\D)|(?<=.{0,200}\D(19|20)\d\d\D.{0,200}))

When I tried this, I got "Variable length lookbehind not implemented in regex"

How can I make this search more efficient?

Replies are listed 'Best First'.
Re: Lookaround Assertions
by moritz (Cardinal) on Jul 21, 2009 at 21:39 UTC
    You can try not use look-around assertions at all, for example by splitting it up into two ordinary alternatives:
    my $copyright = qr{\b copyright (?:ed)? \b }xi; my $date = qr{\b (?:19|20) \d\d \b }x; if ($str =~ s/$copyright.{0,200}$date|$date.{0,200}$copyright/) {...}

    You'd have to check with your real data if it's actually faster, but chances are that it will because it doesn't check every string position for a look-around.

    If you want to stick with your look-around approach, you can use look-aheads, which support variable width regexes.

OT: copyright parsing
by Your Mother (Archbishop) on Jul 21, 2009 at 21:56 UTC

    I'm the resident wet-blanket. I'm not sure what your project covers but putting a notice in the text is not a legal requirement to secure copyrights. So any text lacking the things you are looking for is quite likely to be copyrighted too. Be careful and if you or your employer is doing a serious project, get a lawyer to sign off on all your practices.

Re: Lookaround Assertions
by johngg (Canon) on Jul 21, 2009 at 21:40 UTC

    How about using substr to look only in a window around the copyright text, something like (not tested)

    my $pos; if ( $string =~ m{(?=.*\Wcopyright(\W|ed\W)|\s\(c\)\s)} ) { $pos = pos( $string ) - 200; $pos = 0 if $pos < 0; } else { warn qq{Copyright text not found\n}; next; # or whatever } my $date = $1 if substr( $string, $pos, 400 ) =~ m{\D((?:19|20)\d\d)\D};

    I hope this is useful.

    Cheers,

    JohnGG

    Update: Corrected warning line, I'd mixed double quotes and a qq{...} quoting construct :-/

Re: Lookaround Assertions
by AnomalousMonk (Archbishop) on Jul 22, 2009 at 01:27 UTC
    I like moritz's approach of completely avoiding the use of look-around assertions.

    However, if you are using 5.10 and absolutely must have a variable width look-behind, the  \K "keep" assertion (if that's the proper term) provides a way. See  (?<=pattern) \K in the Look Around Assertions section of perlre.