remember that umlauts (äöüÄÖÜ) are valid for german url´s (.de) since 01.March.
i don´t think Regexp::Common covers that. | [reply] |
Regexp::Common covers what's in RFC 2396 and RFC 2626 when it
comes to HTTP URIs. If those RFC's are superseeded, I'd be interested in hearing about them.
Abigail
| [reply] |
It's RFC3490, Internationalizing Domain Names in Applications (IDNA).
| [reply] |
Internationalized Domain Names adds a mechanism to encode non-ASCII characters in the allowed characters for domain names. The question is should Regexp::Common match the encoded or non-encoded domain names. I would say that the RE should not be changed and the higher level code needs to do the translation.
| [reply] |
just look for "Internationalized Domain Names" in your favorite internet search engine.
i´m too lazy to look for rfc´s right now.
| [reply] |
2130706433 is not a valid IP address as most people think of them. It is the decimal integer corresponding to the binary IP address for 127.0.0.1. The Unix inet_ntoa accepts all kinds of non-standard forms for IP addresses. Everyone else thinks that IP addresses are represented as four decimal numbers sepated by periods. Using anything else will confuse people and programs that expect the standard form.
| [reply] |
2130706433 is not a valid IP address as most people think of them.
Likewise, "login=juerd" is not a valid cookie as most people think of them. They expect them to be edible. What most people think and what is technically correct isn't always the same.
The Unix inet_ntoa accepts all kinds of non-standard forms for IP addresses.
Yes, like the ones formed like "127.0.0.1". This is only a de-facto standard, not an official one. It happens to be accepted by almost everything that takes an IP address. Decimal numbers like "2130706433" are also a de-facto standard; they are just not used as much. The libraries found in Unix, Linux, Windows and Mac OS all think "2130706433" and "127.0.0.1" are the same address.
Everyone else thinks that IP addresses are represented as four decimal numbers sepated by periods. Using anything else will confuse people and programs that expect the standard form.
We could argue about the meaning of "everyone else" or about "anything else", or even about who you think "people" are. Or we could just stick to your point and discuss the "standard" status of dotted decimal IP addresses. That some applications and even some protocols require IP addresses to be stringified like that does not mean that it is the only standard - or that it even is a standard.
Should you have an STD, RFC or another official document that says more on this subject, I'll be happy to hear about it.
| [reply] |
It is a historical standard because it was implemented in the BSD inet_ntoa and copied into other implementation. It may even be standardized in POSIX.
No RFC describes the long form IP address. The RFCs I know that describe grammars for IPv4 addresses only support dotted quad form. This includes URLs.
You can see a few places where differences between expectations create problems. For example, most web browsers parse out the host portion of the http URL and pass it to inet_aton. So they accept "long form" address even when the RFCs say they shouldn't. This is seen with scammers writing URLs like: http://www.example.com@0x7F000001/. They use the username and unexpected IP address syntax to hide the destination.
Including the long form IP addresses in a regular expression makes them much more complicated. The regex has to match one to three components that could be decimal, hex, or octal numbers. Just to accept a format that is only used by a few people.
| [reply] |