Re: generating regexes?

Great problem, it could also be used to figure out what portions of a web page change so an bot could rip out stories from news sites. I suggest thinking of this along the lines of a diff. First determine you record delimiter. In a diff the delimiter is a new line. However with regular expressions you might go with white space (or this could be a command line option). Look up diff and use a similar algorithm. Once you find the components that are different look at the differences. Would they both fit in the same character class. Maybe just go down the following list to see which describes both first.

/^[0-9]$/
/^[0-9]+$/
/^[0-9]*$/
/^[0-9A-Za-z]$/
/^[0-9A-Za-z]+$/
/^[0-9A-Za-z]*$/
/^[0-9A-Za-z.,]$/
/^[0-9A-Za-z.,]+$/
/^[0-9A-Za-z.,]*$/
/^.$/
/^.+$/

#Giving up
/^.*$/
[download]

This might be good for a first pass.
----
I always wanted to be somebody... I guess I should have been more specific.

Comment on Re: generating regexes? Download Code

Replies are listed 'Best First'.
Re: Re: generating regexes? by mortis (Pilgrim) on Nov 20, 2001 at 00:04 UTC
Actualy, that is what I want it for. I've got code that's parsing apart web pages to extract data, and I want it to know when it's not extracting the data correctly. The prototype regex code is used so the parser can generate a 'signature' (regex) that describes the data to be extracted (based on an example set) which it can use to validate that further information matches the same 'signature'. As far as the parsing logic, we're using landmark based location identification. Move forward past 'New Questions', move forward past 'lastnode_id', move forward past '>', extract to '<'. And so on... Kyle	[reply]

Replies are listed 'Best First'.

Re: Re: generating regexes?
by mortis (Pilgrim) on Nov 20, 2001 at 00:04 UTC

As far as the parsing logic, we're using landmark based location identification. Move forward past 'New Questions', move forward past 'lastnode_id', move forward past '>', extract to '<'. And so on...

Kyle

[reply]