Home>Article> Why do double floating point operations lose precision?

Why do double floating point operations lose precision?

步履不停
步履不停 Original
2019-06-26 09:15:40 4501browse

Why do double floating point operations lose precision?

Preface: At work, when it comes to addition, subtraction, multiplication and division with decimal points, they will think of using BigDecimal to solve it, but many people are confused as to why double or float lose precision. And how to solve BigDecimal? Without further ado, let’s get started.

1. What is a floating point number?

Floating point numbers are a data type used by computers to represent decimals, using scientific notation. In Java, double is a double precision, 64-bit, floating point number, and the default is 0.0d. float is single precision, 32 bits. Floating point number, the default is 0.0f;

Why do double floating point operations lose precision?

Store in memory

float Sign bit (1bit) Exponent (8 bit) Mantissa (23 bit)
double Sign bit (1bit) Exponent (11 bit) Mantissa (52 bit)


The exponent of float in the memory is 8bit, because the exponent actually stores For the frameshift of the exponent, assuming that the true value of the exponent is e and the order code is E, then E=e (2^n-1 -1). Among them, 2^n-1 -1 is the exponential offset specified by the IEEE754 standard. According to this formula, we can get 2^8 -1=127. Therefore, the exponent range of float is -128 127, while the exponent range of double is -1024 1023. The negative exponent determines the non-zero number with the smallest absolute value that a floating-point number can express; while the positive exponent determines the number with the largest absolute value that a floating-point number can express, which also determines the value range of a floating-point number.


The range of float is -2^128 ~ 2^127, that is, -3.40E 38 ~ 3.40E 38;
The range of double is -2^1024 ~ 2^1023, also That is -1.79E 308 ~ 1.79E 308

2. Enter the scientific notation of distortion

Let’s talk about scientific notation first. Scientific notation is a method of simplifying counting. Use To approximately represent a very large or small number with a large number of digits, scientific notation has no advantage for values with a small number of digits, but for values with a large number of digits, the advantages of the counting method are very obvious. For example: the speed of light is 300000000 meters/second, and the world's population is approximately 6100000000. Large numbers like the speed of light and the world's population are inconvenient to read and write, so the speed of light can be written as 3*10^8, and the world's population can be written as 6.1*10^9. So the calculator uses scientific notation to indicate that the speed of light is 3E8, and the world's population is approximately 6.1E9.

When we were kids, we used to play with calculators and like to add or subtract like crazy. In the end, the calculator would display the picture below. This is the result displayed by scientific notation

Why do double floating point operations lose precision?

The real value in the picture is -4.86*10^11=-486000000000. Decimal scientific notation requires that the integer part of the significant digit must be within the interval [1, 9].

3. Get into the precision of distortion

When computers process data, they involve data conversion and various complex operations, such as conversion of different units and different bases. (such as binary decimal) conversion, etc., many division operations cannot be divided, such as 10÷3=3.3333...infinite, and the accuracy is limited, 3.3333333x3 is not equal to 10, the decimal obtained after complex processing The data is not precise, and the higher the precision, the more accurate it is. The accuracy of float and double is determined by the number of digits in the mantissa. The integer part is always an implicit "1". Since it is unchanged, it cannot affect the accuracy. float: 2^23 = 8388608, a total of seven digits. Since the leftmost digit is omitted, it means that it can represent up to 8 digits: 28388608 = 16777216. There are 8 significant digits, but it is absolutely guaranteed to be 7 digits, that is, the precision of float is 7~8 significant digits; double: 2^52 = 4503599627370496, a total of 16 digits, similarly, the precision of double is 16~17 Bit.

Why do double floating point operations lose precision?

When it reaches a certain value, it automatically starts using scientific notation and retains significant figures of relevant precision, so the result is an approximate number and the exponent is an integer. In the decimal system, some decimals cannot be fully expressed in binary. Therefore, it can only be represented by limited bits, so there may be errors during storage. To convert decimal decimals into binary, use the multiplication by 2 method to calculate. After removing the integer part, continue to multiply the remaining decimals by 2 until the decimal parts are all 0.

If you encounter the situation where

Why do double floating point operations lose precision?

the output is 0.19999999999999998

double type 0.3-0.1. You need to convert 0.3 into binary in the operation


0.3 * 2 = 0.6 => .0 (.6), take 0 and leave 0.6
0.6 * 2 = 1.2 => .01 (. 2) Take 1 and leave 0.2
0.2 * 2 = 0.4 => .010 (.4) Take 0 and leave 0.4
0.4 * 2 = 0.8 => .0100 (.8) Take 0 and leave 0.8
0.8 * 2 = 1.6 => .01001 (.6) takes 1 and leaves 0.6
.............

Why do double floating point operations lose precision?

3. Summary

After reading the above, it is probably clear why floating point numbers have precision problems. Simply put, the float and double types are mainly designed for scientific calculations and engineering calculations. They perform binary floating point operations, which are carefully designed to provide more accurate and fast near-sum calculations over a wide range of values. However, they do not provide completely accurate results and should not be used for precise results. Floating point numbers that reach a certain size will automatically use scientific notation. Such representation is only an approximation of the real number but not equal to the real number. Infinite loops or exceeding the length of the floating-point mantissa may also occur when converting decimal digits to binary.

4. So how do we use BigDecimal to solve it?

Look at the two outputs below

Why do double floating point operations lose precision?

##Output results:

0.29999999999999988897769753748434595763683319091796875

0.3

As shown in the picture Alibaba's code constraint plug-in has marked a warning and asked me to use the constructor method of String parameters to create BigDecimal. Because double cannot be represented exactly as 0.3 (any finite-length binary), the value passed by the constructor is not exactly equal to 0.3. When using BigDecimal, you must use the constructor method of String parameters to create it. Speaking of which, are there any curious babies who have questions, what is the principle of BigDecimal? Why is there no problem with it? In fact, the principle is very simple. BigDecimal is immutable and can be used to represent signed decimal numbers of any precision. The problem with double is because the decimal point is converted to binary and the precision is lost. During processing, BigDecimal expands the decimal number by N times so that it can be calculated on integers and retains the corresponding precision information. As for how BigDecimal is saved, you can read the source code.

For more technical articles related to frequently asked questions, please visit the

FAQcolumn to learn more!

The above is the detailed content of Why do double floating point operations lose precision?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn