Re: Reliably parsing an integer (updated)
by haukex (Archbishop) on Jun 29, 2020 at 14:42 UTC
|
What the string is: 99999999999999999999
Value of $strAsInteger: 1e+20
How printf sees it: 18446744073709551615
The results are puzzling.
Perl automatically upgrades large integers to floating point.
Perl's maximum values for integers can be determined as I showed in this post.
If you want to work with integers larger than that, use Math::BigInt.
To validate number formats, use Regexp::Common::number.
Update: If two integers are the same length (or are zero-padded to be the same length), Perl's string comparisons can be used.
| [reply] [d/l] |
|
First of all, thanks for your quick answer.
Perl's maximum values for integers can be determined
That is brilliant, thanks.
If you want to work with integers larger than that, use Math::BigInt.
I do not really want to work with larger integers. I want to allow up to the maximum unsigned integer value in the current Perl. I would rather avoid the overhead of using Math::BigInt if I can.
To validate number formats, use Regexp::Common::number.
How are these regular expressions going to help? Say the current Perl uses 32-bit integers. How is a regular expression going to help me accept 4294967295 (UINT32_MAX) but not 4294967296 (UINT32_MAX+1)?
Say Perl automatically "upgrades" UINT64_MAX + 1 to a floating-point value, and the platform uses 64-bit integers but also 64-bit floating point values. Is that not going to lose some precision? How can I then accept exactly up to UINT64_MAX but not UINT64_MAX + 1 ?
| [reply] |
|
How is a regular expression going to help me accept 4294967295 (UINT32_MAX) but not 4294967296 (UINT32_MAX+1)?
AnomalousMonk answered this one; I suggested the module because it's good for verifying various number formats.
I would rather avoid the overhead of using Math::BigInt if I can.
Well, if you want to go right up to that limit*, then I think the safest way is to use Math::BigInt, because it'll correctly handle strings that are clearly over the limit. It's a core module and it's not really that much overhead: once you've used the module to confirm that the number will work as a normal integer without loss of precision, you no longer need the object and can just work with a plain Perl scalar afterwards. Just an example:
use warnings;
use strict;
use feature 'state';
use Carp;
use Math::BigInt;
use Config;
use Regexp::Common qw/number/;
sub validate_int {
my $str = shift;
state $max = Math::BigInt->new(
eval $Config{nv_overflows_integers_at} );
croak "not an integer"
unless defined $str && $str=~/\A$RE{num}{int}\z/;
my $num = Math::BigInt->new($str);
croak "integer to small" if $num < 0;
croak "integer too big" if $num > $max;
return $num->numify;
}
use Test::More;
sub exception (&) { eval { shift->(); 1 } ? undef : ($@ || die) }
is validate_int(0), 0;
is validate_int(1), 1;
is validate_int(3), 3;
ok exception { validate_int(undef) };
ok exception { validate_int("") };
ok exception { validate_int("x") };
ok exception { validate_int("123y") };
ok exception { validate_int(-1) };
ok exception { validate_int("-9999999999999999999999999999999") };
my $x = Math::BigInt->new(eval $Config{nv_overflows_integers_at})-1;
is validate_int("$x"), 0+$x->numify, "'$x' works (max-1)";
$x++;
is validate_int("$x"), 0+$x->numify, "'$x' works (max)";
$x++;
ok exception { validate_int("$x") }, "'$x' fails (max+1)";
ok exception { validate_int("999999999999999999999999999999999999") };
done_testing;
* Update: I named several integer limits in the post I linked to. Depending on which of those limits you want to use, hippo's suggestion from here is of course much easier. | [reply] [d/l] |
|
To validate number formats, use Regexp::Common::number. [emphasis added]
How are these regular expressions going to help? ... How is a regular expression going to help me accept 4294967295 ... but not 4294967296 ...?
Regexes can help validate number formats, but not, in general, ranges. (It's quite often possible to construct a regex to discriminate a number range, but this is usually more of an academic exercise than a practical solution. Common exceptions are for decimal octet and year/month/day ranges.)
... command-line option --resume-from-line ...
This quote from the OP suggests the user is to enter a simple line number of a file. Are you really dealing with source/data/whatever files of more than 4,000,000,000 (or 18,446,744,073,709,551,615 or, God help us all, 99,999,999,999,999,999,999) lines? If not, what do you care if your Perl is UINT32_MAX or UINT64_MAX? Why not just use a validation test something like
$n !~ /\D/ && $n < 4_000_000_000
(or some more reasonable upper limit) and be done with it?
Or is your question intended to address a more general case?
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|
|
|
|
| [reply] [d/l] |
|
|
Re: Reliably parsing an integer
by haukex (Archbishop) on Jun 30, 2020 at 08:46 UTC
|
TIMTOWTDI (and d'oh): For integers matching /^[0-9]+$/, Perl's string comparisons are perfectly fine, provided that one pads the shorter number with zeros.
sub validate_int {
my $str = shift;
state $max = int(eval $Config{nv_overflows_integers_at} or die);
croak "not an integer" unless defined $str && $str=~/\A[0-9]+\z/;
$str =~ s/\A0+(?=[1-9])//;
croak "integer too big" if length $str > length $max
|| sprintf("%0*s", length $max, $str) gt $max;
return 0+$str;
}
Update: A quick test shows that this is faster than the regex version even if I use string comparisons ((?{ $^N gt $max })) instead of Math::BigInt. /Update
| [reply] [d/l] [select] |
Re: Reliably parsing an integer
by nysus (Parson) on Jun 29, 2020 at 15:05 UTC
|
| [reply] |
Re: Reliably parsing an integer
by leszekdubiel (Scribe) on Jul 03, 2020 at 16:12 UTC
|
Try to craft this solution to your needs -- it reliably parses floating point, change that to integer.
#!/usr/bin/perl
use strict;
use Config;
# check architecture and die if not as expected
$Config{nvtype} eq "double" && $Config{doublesize} == 8
&& $Config{ivtype} eq "long" && $Config{longsize} == 8
or die "not compatibile perl";
my $a = "-000123.456000";
# format good or die
$a =~ /\G([-+]?(?=[0-9]|\.[0-9])[0-9]*(\.[0-9]*)?)/gc or die "not a nu
+mber";
my $n = $1;
# max length is 15, because architecure is such... convert string to n
+umber
length $n <= 15 + 2 or die "number too long";
$n = 0 + $n;
# check if number fits within 15 digits
abs $n > 99999999999999.9 and die "number out of range";
print "$a ---> $n is OK\n";
....
root@orion:~# perl /tmp/aaaa
-000123.456000 ---> -123.456 is OK
| [reply] [d/l] |
Re: Reliably parsing an integer
by harangzsolt33 (Chaplain) on Jun 29, 2020 at 18:54 UTC
|
What you need is a BigInt Compare function that allows you to tell if your potentially big integer stored in a string is bigger than Perl's max value. And if it is, then just ignore it or use the max value. You could also just check the length of the string. That would be the fastest solution. If the string is longer than, let's say, 8 bytes, then it may be a number that is too big. So, instead of using that number, you just use 99,999,999. Anything above that value gets cut off. Or you could use this:
##################################################
# v2019.9.27
# Compares two large positive integers.
# The integers can be binary, octal,
# decimal, or hexadecimal.
#
# NOTE: Both numbers must be in the same base.
# The numbers should not contain spaces, tabs, line breaks,
# minus sign, decimal points, or anything other than digits!
# Illegal characters can mess up the result.
#
# Returns: 0 if they are equal
# 1 if the first one is greater
# 2 if the second one is greater
#
# Special cases:
# * When comparing zero against an empty string or
# undefined value, the zero will be greater.
# * When comparing an undefined value against
# an empty string, they will be equal.
#
# Usage: INTEGER = CMP(STRING, STRING)
#
sub CMP
{
my $A = defined $_[0] ? uc($_[0]) : '';
my $B = defined $_[1] ? uc($_[1]) : '';
my $AL = length($A);
my $BL = length($B);
return 2 if ($AL < $BL);
return 1 if ($AL > $BL);
return 0 if ($A eq $B);
# At this point, we know that both numbers have the
# same length, and one of them is greater than the other.
my $DIFF = 0;
for (my $i = 0; $DIFF == 0 && $i < $AL; $i++)
{
$DIFF = vec($A, $i, 8) - vec($B, $i, 8);
}
return ($DIFF > 0) ? 1 : 2;
}
DISCLAIMER: I am a beginner perl programmer. I wrote this sub last year, and it may have some bugs! For example, the most obvious one is that if you compare two strings "003" and "13" the result will be that the first one is greater. Why? Because it's longer. Lol :P | [reply] [d/l] |
|
$DIFF = vec($A, $i, 8) - vec($B, $i, 8);
As I warned you over a year ago, vec on strings that happen to contain Unicode code points is now a fatal error, as of the newly released 5.32 it dies with "Use of strings with code points over 0xFF as arguments to vec is forbidden". Simply documenting "Illegal characters can mess up the result" is not robust. Sorry, but I've commented on it often enough: while you're free to code as you like, I can no longer recommend to anyone to use your "reinvented wheel" code.
Update: Added more links.
Update 2:
DISCLAIMER: ...
Please mark your updates as such.
| [reply] [d/l] [select] |
|
Okay. I couldn't sleep until I corrected my error. This CMP sub works now!! Run the test and see it for yourself! Btw using vec() is not a mistake. If someone is trying to run UNICODE letters through this sub, then there's a serious error in the code, and it *should* fail. The programmer needs to test each string to make sure it contains nothing else but plain digits before trying to compare the two. Maybe I should include a line which converts a UNICODE string to plain ASCII string, but I don't know how to do that magic... :D
#!/usr/bin/perl -w
use strict;
use warnings;
print CMP("0", $b);
print CMP("", $b);
print CMP("0", "");
print CMP("", "000");
print CMP("", "55");
print CMP("111", "55");
print CMP("8,000,021", "7,999,999");
print CMP("003", "1");
print CMP("001", "2");
print CMP("003", "11");
print CMP("54", "45");
print CMP("123", "32");
print CMP("5", "5");
print CMP("1222225", "001222225");
print CMP(" 15", "15");
print CMP("0010", "100");
print CMP("C97F", "C97E");
print CMP("2E", "AE");
print CMP("00101 ", "00101");
exit;
##################################################
# v2020.06.30
# Compares two large positive integers.
# The integers can be binary (ones and zeros),
# octal, decimal, or hexadecimal.
#
# NOTE: Both numbers must be in the same base.
# You shouldn't try to compare a binary number such
# as "10001101" to a hex number like "C4"
# as this will give a bad result.
#
# Returns: 0 if the numbers are equal
# 1 if the first one is greater
# 2 if the second one is greater
#
# Special cases:
# * When comparing an undefined value against
# an empty string or zero, they will be equal.
# * Minus signs are always ignored!
#
# Usage: INTEGER = CMP(STRING, STRING)
#
sub CMP
{
my $A = defined $_[0] ? uc($_[0]) : '';
my $B = defined $_[1] ? uc($_[1]) : '';
my $A2 = length($A);
my $B2 = length($B);
my ($A1, $B1, $CA, $CB, $DIFF) = (0, 0, 48, 48, 0);
# SHOW WHAT'S HAPPENING:
print "\n\nString1=|$A|\nString2=|$B| RET=";
# Find the first significant digit or starting pointer for each stri
+ng.
# We will call this A1 and B1. In case the string starts with zeros,
# spaces, tabs, new line characters, - and + signs, or other special
# characters, we skip through those. We ignore them.
while ($A1 < $A2 && vec($A, $A1, 8) < 49) { $A1++; }
while ($B1 < $B2 && vec($B, $B1, 8) < 49) { $B1++; }
# Find last significant digit or ending pointer for each string.
# We will call this A2 and B2.
while ($A2 > $A1 && vec($A, --$A2, 8) < 48) {} $A2++;
while ($B2 > $B1 && vec($B, --$B2, 8) < 48) {} $B2++;
# Calculate the number of digits in each number.
my $AL = $A2 - $A1;
my $BL = $B2 - $B1;
# Are both numbers the same length?
if ($AL == $BL)
{
# Compare from left to right, incrementing
# pointers A1 and B1 as we walk through all the digits.
while ($A1 < $A2)
{
$CA = vec($A, $A1++, 8); # Get digit from string A
$CB = vec($B, $B1++, 8); # Get digit from string B
$DIFF = $CA - $CB;
if ($DIFF) { return $DIFF < 0 ? 2 : 1; }
}
return 0;
}
return 1 if ($AL > $BL);
return 2 if ($AL < $BL);
return 0;
}
| [reply] [d/l] |
|
| [reply] |