On Mon, 28 Feb 2022, Marcin Błażejowski wrote:
Hi,
why below script works fine for integers \in [2,53] and does not work for
interegers >= 54?
That's because in a 64-bit floating-point value 11 bits are reserved
for the exponent, leaving only 53 for the mantissa. Integers up to
2^53 can be exactly represented as doubles, larger integers cannot
be exactly represented.
Allin
<hansl>
function matrix dectobitconvert (const scalar num, const scalar n)
x = num
ret = zeros(n, 1)
i = 1
loop while x>0 --quiet
y = x % 2
ret[i++] = y
x -= y
x /= 2
endloop
return transp(mreverse(ret))
end function
############################
function scalar bintodecconvert (const matrix mod_struct, const scalar k)
return sum(mod_struct .* 2 .^ seq(k-1, 0))
end function
############################
lenght = 53 # from 54 we get errors
bindigit = ones(1, lenght)
digit = bintodecconvert(bindigit, lenght)
bindigit2 = dectobitconvert(digit, lenght)
printf "Control1: %d\n", sum(bindigit - bindigit2)
digit2 = bintodecconvert(bindigit2, lenght)
printf "Control1: %d\n", digit - digit2
</hansl>