comment on

You also might want to consider a different approach. It's really hard to define what a "valid" URL is. Maybe you only need http://, or maybe http:// and ftp:// or etc.. etc.. Then there's the problem of non standard URLS that I'm sure someone is using or will start to use. For instance if Microsoft released a product that used a URL like bill://. You might have to support it, even if it's not in a standard.

Rather than trying to validate the entire url as a regex, break it into parts, then test them. For instance, test the bit you think is a host name by running gethostbyname() and test the part that names the protocol by running getservbynam().

This takes some of the strain off your regex. The best part is, you don't have to update your script to keep up with changes in the world. If a new bill:// protocol comes out (and you keep your /etc/services file up to date), your script won't miss a beat. Even more likely is a new top-level domain.

Of course, this will impact performance, so you need to ask yourself how fast you need this to be and how well you need it to check the URL. If letting a bad URL through is just a little annoying, it might be easiest to cull out the really egregious offenders and let the slippery ones pass. If on the other hand, you really suffer if a bad URL makes it past this test, it might be worth the clock cycles.

In reply to Re: regex to match URLs by pileofrogs
in thread regex to match URLs by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Clear questions and runnable code get the best and fastest answer
	PerlMonks