Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Untaint IP address/hostname question

by Juerd (Abbot)
on Mar 08, 2004 at 17:15 UTC ( [id://334877]=note: print w/replies, xml ) Need Help??


in reply to Untaint IP address/hostname question

Regexp::Common's $RE{net}{IPv4} and $RE{net}{domain}{-nospace}

Note that 2130706433 is in fact a valid IP address (equal to 127.0.0.1) and that you might just want to try inet_ntoa inet_aton $ip instead. (These can be found in the standard module Socket).

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Replies are listed 'Best First'.
Re: Re: Untaint IP address/hostname question
by Taulmarill (Deacon) on Mar 08, 2004 at 17:26 UTC
    remember that umlauts (äöüÄÖÜ) are valid for german url´s (.de) since 01.March.
    i don´t think Regexp::Common covers that.
      Regexp::Common covers what's in RFC 2396 and RFC 2626 when it comes to HTTP URIs. If those RFC's are superseeded, I'd be interested in hearing about them.

      Abigail

        It's RFC3490, Internationalizing Domain Names in Applications (IDNA).
        Internationalized Domain Names adds a mechanism to encode non-ASCII characters in the allowed characters for domain names. The question is should Regexp::Common match the encoded or non-encoded domain names. I would say that the RE should not be changed and the higher level code needs to do the translation.
        just look for "Internationalized Domain Names" in your favorite internet search engine.
        i´m too lazy to look for rfc´s right now.
Re: Re: Untaint IP address/hostname question
by iburrell (Chaplain) on Mar 08, 2004 at 19:51 UTC
    2130706433 is not a valid IP address as most people think of them. It is the decimal integer corresponding to the binary IP address for 127.0.0.1. The Unix inet_ntoa accepts all kinds of non-standard forms for IP addresses. Everyone else thinks that IP addresses are represented as four decimal numbers sepated by periods. Using anything else will confuse people and programs that expect the standard form.

      2130706433 is not a valid IP address as most people think of them.

      Likewise, "login=juerd" is not a valid cookie as most people think of them. They expect them to be edible. What most people think and what is technically correct isn't always the same.

      The Unix inet_ntoa accepts all kinds of non-standard forms for IP addresses.

      Yes, like the ones formed like "127.0.0.1". This is only a de-facto standard, not an official one. It happens to be accepted by almost everything that takes an IP address. Decimal numbers like "2130706433" are also a de-facto standard; they are just not used as much. The libraries found in Unix, Linux, Windows and Mac OS all think "2130706433" and "127.0.0.1" are the same address.

      Everyone else thinks that IP addresses are represented as four decimal numbers sepated by periods. Using anything else will confuse people and programs that expect the standard form.

      We could argue about the meaning of "everyone else" or about "anything else", or even about who you think "people" are. Or we could just stick to your point and discuss the "standard" status of dotted decimal IP addresses. That some applications and even some protocols require IP addresses to be stringified like that does not mean that it is the only standard - or that it even is a standard.

      Should you have an STD, RFC or another official document that says more on this subject, I'll be happy to hear about it.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        It is a historical standard because it was implemented in the BSD inet_ntoa and copied into other implementation. It may even be standardized in POSIX.

        No RFC describes the long form IP address. The RFCs I know that describe grammars for IPv4 addresses only support dotted quad form. This includes URLs.

        You can see a few places where differences between expectations create problems. For example, most web browsers parse out the host portion of the http URL and pass it to inet_aton. So they accept "long form" address even when the RFCs say they shouldn't. This is seen with scammers writing URLs like: http://www.example.com@0x7F000001/. They use the username and unexpected IP address syntax to hide the destination.

        Including the long form IP addresses in a regular expression makes them much more complicated. The regex has to match one to three components that could be decimal, hex, or octal numbers. Just to accept a format that is only used by a few people.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://334877]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-24 23:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found