http://qs321.pair.com?node_id=11118650

rdiez has asked for the wisdom of the Perl Monks concerning the following question:

Hi all:

I am writing the best Perl script since sliced bread:

https://github.com/rdiez/Tools/tree/master/RDChecksum

But I have other, equally super scripts around. Top of the pops.

Anyway, I do not really trust my pesky users. Some of them have been known to pass some rubbish in command-line option --resume-from-line, instead of an integer. You know, something like --resume-from-line='from yesterday, you know what I mean'. The script is also reading a file which contains file sizes, and those sizes should be integers too. But who knows what the script might encounter there.

So I decided to do some validation, like you do in C++ or in Java: read as string, convert it to integer, and if that fails, tell the user: "Invalid integer. Sod off."

I was also hoping that, if the number is a real integer, Perl will run faster. In any case, I need to write those values to a file, so the output must look like a valid integer too.

Alas, I am finding it difficult to do a simple integer validation in Perl. The first thing I did is to write routine has_non_digits() in order to discard anything that is obviously not an integer, see the script I mentioned above. But it looks like that is not enough. If you try the following test code:

my $str = "99999999999999999999";

print "What the string is: $str\n";

my $strAsInteger = int( $str );

print "Value of \$strAsInteger: $strAsInteger\n";

printf "How printf sees it: %u\n", $strAsInteger;

You will get this result:

What the string is: 99999999999999999999
Value of $strAsInteger: 1e+20
How printf sees it: 18446744073709551615

The results are puzzling. Note that there is no sign of a hint, a warning, or anything helpful there.

Now, let's assume for a moment that anybody of us gives a damn about writing perfect Perl scripts. What would be the best way to parse an integer?

Because my script is not doing any arithmetic, I would like to accept integers up to the maximum integer limit. This particular script is fussy and only wants to run on Perl versions compiled with 64-bit integer support, but I have other scripts that are not so choosy, so the integer size might be 32 bits. In any case, the maximum value is not a round base-10 value like "9999".

In the name of all users that have the privilege of running my fine scripts, I thank you all.
     rdiez

Replies are listed 'Best First'.
Re: Reliably parsing an integer (updated)
by haukex (Archbishop) on Jun 29, 2020 at 14:42 UTC
    What the string is: 99999999999999999999 Value of $strAsInteger: 1e+20 How printf sees it: 18446744073709551615
    The results are puzzling.

    Perl automatically upgrades large integers to floating point.

    Perl's maximum values for integers can be determined as I showed in this post.

    If you want to work with integers larger than that, use Math::BigInt.

    To validate number formats, use Regexp::Common::number.

    Update: If two integers are the same length (or are zero-padded to be the same length), Perl's string comparisons can be used.

      First of all, thanks for your quick answer.

      Perl's maximum values for integers can be determined

      That is brilliant, thanks.

      If you want to work with integers larger than that, use Math::BigInt.

      I do not really want to work with larger integers. I want to allow up to the maximum unsigned integer value in the current Perl. I would rather avoid the overhead of using Math::BigInt if I can.

      To validate number formats, use Regexp::Common::number.

      How are these regular expressions going to help? Say the current Perl uses 32-bit integers. How is a regular expression going to help me accept 4294967295 (UINT32_MAX) but not 4294967296 (UINT32_MAX+1)?

      Say Perl automatically "upgrades" UINT64_MAX + 1 to a floating-point value, and the platform uses 64-bit integers but also 64-bit floating point values. Is that not going to lose some precision? How can I then accept exactly up to UINT64_MAX but not UINT64_MAX + 1 ?

        How is a regular expression going to help me accept 4294967295 (UINT32_MAX) but not 4294967296 (UINT32_MAX+1)?

        AnomalousMonk answered this one; I suggested the module because it's good for verifying various number formats.

        I would rather avoid the overhead of using Math::BigInt if I can.

        Well, if you want to go right up to that limit*, then I think the safest way is to use Math::BigInt, because it'll correctly handle strings that are clearly over the limit. It's a core module and it's not really that much overhead: once you've used the module to confirm that the number will work as a normal integer without loss of precision, you no longer need the object and can just work with a plain Perl scalar afterwards. Just an example:

        use warnings; use strict; use feature 'state'; use Carp; use Math::BigInt; use Config; use Regexp::Common qw/number/; sub validate_int { my $str = shift; state $max = Math::BigInt->new( eval $Config{nv_overflows_integers_at} ); croak "not an integer" unless defined $str && $str=~/\A$RE{num}{int}\z/; my $num = Math::BigInt->new($str); croak "integer to small" if $num < 0; croak "integer too big" if $num > $max; return $num->numify; } use Test::More; sub exception (&) { eval { shift->(); 1 } ? undef : ($@ || die) } is validate_int(0), 0; is validate_int(1), 1; is validate_int(3), 3; ok exception { validate_int(undef) }; ok exception { validate_int("") }; ok exception { validate_int("x") }; ok exception { validate_int("123y") }; ok exception { validate_int(-1) }; ok exception { validate_int("-9999999999999999999999999999999") }; my $x = Math::BigInt->new(eval $Config{nv_overflows_integers_at})-1; is validate_int("$x"), 0+$x->numify, "'$x' works (max-1)"; $x++; is validate_int("$x"), 0+$x->numify, "'$x' works (max)"; $x++; ok exception { validate_int("$x") }, "'$x' fails (max+1)"; ok exception { validate_int("999999999999999999999999999999999999") }; done_testing;

        * Update: I named several integer limits in the post I linked to. Depending on which of those limits you want to use, hippo's suggestion from here is of course much easier.

        To validate number formats, use Regexp::Common::number. [emphasis added]
        How are these regular expressions going to help? ... How is a regular expression going to help me accept 4294967295 ... but not 4294967296 ...?
        Regexes can help validate number formats, but not, in general, ranges. (It's quite often possible to construct a regex to discriminate a number range, but this is usually more of an academic exercise than a practical solution. Common exceptions are for decimal octet and year/month/day ranges.)

        ... command-line option --resume-from-line ...

        This quote from the OP suggests the user is to enter a simple line number of a file. Are you really dealing with source/data/whatever files of more than 4,000,000,000 (or 18,446,744,073,709,551,615 or, God help us all, 99,999,999,999,999,999,999) lines? If not, what do you care if your Perl is UINT32_MAX or UINT64_MAX? Why not just use a validation test something like
            $n !~ /\D/ && $n < 4_000_000_000
        (or some more reasonable upper limit) and be done with it?

        Or is your question intended to address a more general case?


        Give a man a fish:  <%-{-{-{-<

        «I do not really want to work with larger integers.»

        That’s OK. As an alternative you may try sliced bread. Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: Reliably parsing an integer
by haukex (Archbishop) on Jun 30, 2020 at 08:46 UTC

    TIMTOWTDI (and d'oh): For integers matching /^[0-9]+$/, Perl's string comparisons are perfectly fine, provided that one pads the shorter number with zeros.

    sub validate_int { my $str = shift; state $max = int(eval $Config{nv_overflows_integers_at} or die); croak "not an integer" unless defined $str && $str=~/\A[0-9]+\z/; $str =~ s/\A0+(?=[1-9])//; croak "integer too big" if length $str > length $max || sprintf("%0*s", length $max, $str) gt $max; return 0+$str; }

    Update: A quick test shows that this is faster than the regex version even if I use string comparisons ((?{ $^N gt $max })) instead of Math::BigInt. /Update

Re: Reliably parsing an integer
by nysus (Parson) on Jun 29, 2020 at 15:05 UTC
Re: Reliably parsing an integer
by leszekdubiel (Scribe) on Jul 03, 2020 at 16:12 UTC
    Try to craft this solution to your needs -- it reliably parses floating point, change that to integer.
    #!/usr/bin/perl use strict; use Config; # check architecture and die if not as expected $Config{nvtype} eq "double" && $Config{doublesize} == 8 && $Config{ivtype} eq "long" && $Config{longsize} == 8 or die "not compatibile perl"; my $a = "-000123.456000"; # format good or die $a =~ /\G([-+]?(?=[0-9]|\.[0-9])[0-9]*(\.[0-9]*)?)/gc or die "not a nu +mber"; my $n = $1; # max length is 15, because architecure is such... convert string to n +umber length $n <= 15 + 2 or die "number too long"; $n = 0 + $n; # check if number fits within 15 digits abs $n > 99999999999999.9 and die "number out of range"; print "$a ---> $n is OK\n"; .... root@orion:~# perl /tmp/aaaa -000123.456000 ---> -123.456 is OK
Re: Reliably parsing an integer
by harangzsolt33 (Chaplain) on Jun 29, 2020 at 18:54 UTC
    What you need is a BigInt Compare function that allows you to tell if your potentially big integer stored in a string is bigger than Perl's max value. And if it is, then just ignore it or use the max value. You could also just check the length of the string. That would be the fastest solution. If the string is longer than, let's say, 8 bytes, then it may be a number that is too big. So, instead of using that number, you just use 99,999,999. Anything above that value gets cut off. Or you could use this:

    ################################################## # v2019.9.27 # Compares two large positive integers. # The integers can be binary, octal, # decimal, or hexadecimal. # # NOTE: Both numbers must be in the same base. # The numbers should not contain spaces, tabs, line breaks, # minus sign, decimal points, or anything other than digits! # Illegal characters can mess up the result. # # Returns: 0 if they are equal # 1 if the first one is greater # 2 if the second one is greater # # Special cases: # * When comparing zero against an empty string or # undefined value, the zero will be greater. # * When comparing an undefined value against # an empty string, they will be equal. # # Usage: INTEGER = CMP(STRING, STRING) # sub CMP { my $A = defined $_[0] ? uc($_[0]) : ''; my $B = defined $_[1] ? uc($_[1]) : ''; my $AL = length($A); my $BL = length($B); return 2 if ($AL < $BL); return 1 if ($AL > $BL); return 0 if ($A eq $B); # At this point, we know that both numbers have the # same length, and one of them is greater than the other. my $DIFF = 0; for (my $i = 0; $DIFF == 0 && $i < $AL; $i++) { $DIFF = vec($A, $i, 8) - vec($B, $i, 8); } return ($DIFF > 0) ? 1 : 2; }

    DISCLAIMER: I am a beginner perl programmer. I wrote this sub last year, and it may have some bugs! For example, the most obvious one is that if you compare two strings "003" and "13" the result will be that the first one is greater. Why? Because it's longer. Lol :P

      $DIFF = vec($A, $i, 8) - vec($B, $i, 8);

      As I warned you over a year ago, vec on strings that happen to contain Unicode code points is now a fatal error, as of the newly released 5.32 it dies with "Use of strings with code points over 0xFF as arguments to vec is forbidden". Simply documenting "Illegal characters can mess up the result" is not robust. Sorry, but I've commented on it often enough: while you're free to code as you like, I can no longer recommend to anyone to use your "reinvented wheel" code.

      Update: Added more links.

      Update 2:

      DISCLAIMER: ...

      Please mark your updates as such.

        Okay. I couldn't sleep until I corrected my error. This CMP sub works now!! Run the test and see it for yourself! Btw using vec() is not a mistake. If someone is trying to run UNICODE letters through this sub, then there's a serious error in the code, and it *should* fail. The programmer needs to test each string to make sure it contains nothing else but plain digits before trying to compare the two. Maybe I should include a line which converts a UNICODE string to plain ASCII string, but I don't know how to do that magic... :D

        #!/usr/bin/perl -w use strict; use warnings; print CMP("0", $b); print CMP("", $b); print CMP("0", ""); print CMP("", "000"); print CMP("", "55"); print CMP("111", "55"); print CMP("8,000,021", "7,999,999"); print CMP("003", "1"); print CMP("001", "2"); print CMP("003", "11"); print CMP("54", "45"); print CMP("123", "32"); print CMP("5", "5"); print CMP("1222225", "001222225"); print CMP(" 15", "15"); print CMP("0010", "100"); print CMP("C97F", "C97E"); print CMP("2E", "AE"); print CMP("00101 ", "00101"); exit; ################################################## # v2020.06.30 # Compares two large positive integers. # The integers can be binary (ones and zeros), # octal, decimal, or hexadecimal. # # NOTE: Both numbers must be in the same base. # You shouldn't try to compare a binary number such # as "10001101" to a hex number like "C4" # as this will give a bad result. # # Returns: 0 if the numbers are equal # 1 if the first one is greater # 2 if the second one is greater # # Special cases: # * When comparing an undefined value against # an empty string or zero, they will be equal. # * Minus signs are always ignored! # # Usage: INTEGER = CMP(STRING, STRING) # sub CMP { my $A = defined $_[0] ? uc($_[0]) : ''; my $B = defined $_[1] ? uc($_[1]) : ''; my $A2 = length($A); my $B2 = length($B); my ($A1, $B1, $CA, $CB, $DIFF) = (0, 0, 48, 48, 0); # SHOW WHAT'S HAPPENING: print "\n\nString1=|$A|\nString2=|$B| RET="; # Find the first significant digit or starting pointer for each stri +ng. # We will call this A1 and B1. In case the string starts with zeros, # spaces, tabs, new line characters, - and + signs, or other special # characters, we skip through those. We ignore them. while ($A1 < $A2 && vec($A, $A1, 8) < 49) { $A1++; } while ($B1 < $B2 && vec($B, $B1, 8) < 49) { $B1++; } # Find last significant digit or ending pointer for each string. # We will call this A2 and B2. while ($A2 > $A1 && vec($A, --$A2, 8) < 48) {} $A2++; while ($B2 > $B1 && vec($B, --$B2, 8) < 48) {} $B2++; # Calculate the number of digits in each number. my $AL = $A2 - $A1; my $BL = $B2 - $B1; # Are both numbers the same length? if ($AL == $BL) { # Compare from left to right, incrementing # pointers A1 and B1 as we walk through all the digits. while ($A1 < $A2) { $CA = vec($A, $A1++, 8); # Get digit from string A $CB = vec($B, $B1++, 8); # Get digit from string B $DIFF = $CA - $CB; if ($DIFF) { return $DIFF < 0 ? 2 : 1; } } return 0; } return 1 if ($AL > $BL); return 2 if ($AL < $BL); return 0; }
        Oops.. Sorry!