I remember being told in AP Computer Science never to use single-precision floats and always use doubles, as computers are so fast nowadays with so much memory it doesn't matter. That seems to be the general advice, the same dogma came up in a conversation about numerical methods I had just a few hours before writing this article. If you'd asked me a few months ago I would have said single precision might not be a bad idea and half precision even more so. Well, I now say only half of that previous statement is true, and it's not for the reasons I might have guessed. For most things, save this specific example of numerical methods, single precision is plenty, half precision would even probably be enough in cases like AI. The catch is my numerical methods so far run on a CPU and are 90% sequential. Why is that a problem? Shouldn't half precision still be lighter than full precision, there are less bits?
Yes! But also no. The issue lies in the x86 architecture that modern x86_64/AMD64 processors are extended from. x86 is a 32-bit architecture, and can have trouble sometimes when working with 16-bit floats. 64-bit extensions are also why double precision doesn't take much of a hit on CPUs. Moving over to GPUs, and things get a little messier. An Nvidia Tesla P100 has 9.526 teraflops of float32 compute performance. A GTX 1080 Ti has 11.34 teraflops of float32. My RTX 2080 Super has 11.2 teraflops of float32 performance. That means my GPU is better than a P100 and about equal to a 1080 Ti, right? Yes and no! Let's look at double precision, see if that sheds any light. My 2080 Super and the 1080 Ti both support approximately 350 gigaflops of float64 performance, a 1:32 performance slash. The P100? It runs a cool 4.7 teraflops, half its single-precision spec. What's going on? Modern flagship Tesla GPUs have special FP64 cores that boost their FP64 speeds. Gaming cards lack these, and as such the two GeForce cards pale in comparison. Now let's see how float16 goes down. The P100 scores 19.05 teraflops, and my RTX 2080 Super gets 22.5 teraflops, and the 1080 Ti gets ... 177 gigaflops. What? The first two doubled but the 1080 Ti dropped at a ratio of 1:64, even worse than double precision's hit! The 1080 Ti lacks a way to process float16, and has to do some voodoo to get it to work at all. The P100 is designed to process half-precision, and the 2080 Super has Tensor Cores, something new with the Volta and Turing architectures that improve upon basic half-precision cores. Things aren't so black and white now that you look at them. A bit of trivia/bonus to any early adopters of RTX out there. Turing has one more trick up its sleeve: simultaneous INT32 and FP32 processing. Each Turing streaming multiprocessor (SM) supports concurrent integer and floating-point math, something removed in Ampere in exchange for a mix of floating-point cores and dual-purpose cores. That's another reason why the 2080 Super and even the basic 2080 can beat the 1080 Ti. It's not always as simple as it seems, but in our modern era basic intuition proves correct: less bits = more speed!
0 Comments
|
DanielI'm a software engineer, volunteer IT support, amateur blogger, casual gamer, and tech enthusiast. I also love cars and the great outdoors. Archives
May 2021
Categories |