Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: generating regexes?

by atlantageek (Monk)
on Nov 19, 2001 at 23:55 UTC ( [id://126359]=note: print w/replies, xml ) Need Help??


in reply to generating regexes?

Great problem, it could also be used to figure out what portions of a web page change so an bot could rip out stories from news sites. I suggest thinking of this along the lines of a diff. First determine you record delimiter. In a diff the delimiter is a new line. However with regular expressions you might go with white space (or this could be a command line option). Look up diff and use a similar algorithm. Once you find the components that are different look at the differences. Would they both fit in the same character class. Maybe just go down the following list to see which describes both first.
/^[0-9]$/ /^[0-9]+$/ /^[0-9]*$/ /^[0-9A-Za-z]$/ /^[0-9A-Za-z]+$/ /^[0-9A-Za-z]*$/ /^[0-9A-Za-z.,]$/ /^[0-9A-Za-z.,]+$/ /^[0-9A-Za-z.,]*$/ /^.$/ /^.+$/ #Giving up /^.*$/
This might be good for a first pass.
----
I always wanted to be somebody... I guess I should have been more specific.

Replies are listed 'Best First'.
Re: Re: generating regexes?
by mortis (Pilgrim) on Nov 20, 2001 at 00:04 UTC
    Actualy, that is what I want it for. I've got code that's parsing apart web pages to extract data, and I want it to know when it's not extracting the data correctly. The prototype regex code is used so the parser can generate a 'signature' (regex) that describes the data to be extracted (based on an example set) which it can use to validate that further information matches the same 'signature'.

    As far as the parsing logic, we're using landmark based location identification. Move forward past 'New Questions', move forward past 'lastnode_id', move forward past '>', extract to '<'. And so on...

    Kyle

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://126359]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-23 15:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found