AI Chip Wars: LPUs, TPUs & GPUs

Guests:
Ram Ahluwalia & Jonathan Ross
Date:
12/02/2023

Thank you for Listening this Episode!

Support our podcast by spreading the word to new listeners. We deeply appreciate your support!

Episode Description

Cut through the noise with Jonathan Ross, designer of the first TPU at Google, now building Groq to democratize AI access.

Episode Transcript

[00:00:00] Hey Jonathan, how are you? I'm great, thanks for having me. Pleasure to host you. Really excited to dig in. Jonathan is the CEO and founder of Grok, which has built a high performance AI. Jonathan, we should get a quick demo of that at the outset as well. I think it'd be fun. And just to tee up Jonathan, so Google has this policy enabling engineers to have 20% of their time to work on these crazy skunk work projects.

And Jonathan, while he was there, was one of the crazy engineers. I mean that with all the affection in the world developed the TPU, that's correct. That TPU is a rival to the GPU. And we'll get into these differences between, GPUs and TPUs. CPUs were the. 60s through the 90s, and then also LPUs as well we're going to talk about the winners and losers in AI, the future of AI, when should we expect AGI [00:01:00] and we'll also again do the personal motivations that Jonathan has for developing Grok.

So why don't we start with that demo? I thought this was wildly fast. Is there a way to share a screen here? Can you see it now? Adding There it is. There you go. Okay. Why don't you ask me a question? While you're thinking of a question, I'll explain. I got a question. Oh really? Okay, but let me explain so people understand what you're looking at.

So this is Llama 270 billion. This is the model, the largest model that Meta has open sourced. Meta been in the news a little bit for making LLMs available to everyone, right? But typically they run slow because the big models are expensive to run. So typically you'll get maybe 10 words per second or something out of this.

And now this one is running on our hardware, so it'll be a little faster. So go ahead and ask a question. What semiconductor stock? Should I purchase? Oh, and by the way, can you still [00:02:00] see it or did I mess it up? I can still see it. Okay, perfect. What semiconductor stock should I purchase? Yep.

And there it's done. That was fast. That was lightning fast. That was speedy. That was the typical responses that we get and that are acceptable are wow or a string of expletives is the typical. If I asked GPT open AI to do that. It would probably have finished maybe four seconds ago. It would still be processing.

Yeah, it would still be going. We have some comparisons that we've put up, but not a perfectly fair comparison. Their model is technically bigger, but they've also shrunk it by distilling. We haven't distilled, so it's fair. One thing that's clear GPT on our hardware, it would probably be about this fast.

I think that the main takeaway here is that the winners and losers at the application layer are not yet resolved. We're still in the early [00:03:00] innings. If you think back to the late 90s with the search wars where we had AltaVista and Lycos and Excel, it wasn't until a few years where Google had a technical breakthrough with PageRank and that then led them to crush the competition.

Yahoo's gone. So I think it was quite interesting. This is an example of a technical breakthrough and that stems from your engineering background. This is a good segue into Grok. So why don't you introduce Grok? Is it a software play? Is it a hardware play? Who do you comp to? Yeah. So trying to compare us is a little bit of a problem because we're not exactly like any other company.

We You're a unicorn, but financially, market cap, and also positioning, go ahead. So we build the chips. But we're really offering the service as an API and we are willing to sell the systems and the chips. But we're mostly selling the API at the moment, so just like you would buy tokens of capacity from a Jeep from an [00:04:00] OpenAI, from an Anthropic, and so on now, we're running the open source models, right?

This is, there's some people who either want an open source model for philosophical reasons, just inherently, that's all they're comfortable with there's some people who need speed. And this is where we have a huge advantage. There are we don't directly compete with NVIDIA or AMD or anyone else.

If, in fact, I think two weeks ago, someone came to us and their use case was not latency sensitive, and we would have been less expensive for them to run on us. We're a little bit more focused on the business than than on GPUs, but we declined the business because we're focused on those who need latency, where they just, they can't get their problem solved anywhere else.

Finance has plenty latency demands. I would say those who are representing so customer representative. Tools like live coaching anything involving the loading of web pages that's [00:05:00] typically very time sensitive ads but very time sensitive applications. Got it.

Got it. So you expose an API and you're primarily targeting enterprise customers? That's correct. And have you designed your own chip? That's right. So actually I have one here. So this is one of our software. That's correct. So this chip right here this is the first peta op chip. That's a one quadrillion operations per second that was ever built.

And so we built that. Which it's a peta op? A peta op. Peta op. I learned something new today. So you've heard of giga ops. You've heard of tera ops. This is a peta op, so one followed by think 15 zeros, but I'm not even sure it's so big. How does that compare to the standard of R in the market? So I think the latest GPUs can do a peta op, but there's a difference, right?

The [00:06:00] difference is when you're using a GPU, GPUs are really great at training models. I, when someone wants to train a model, I'm like, just use GPUs. Don't talk to us because they can get the same number of ops as we can. But the big difference is when you're running one of these models, not training them, running them after they've already been made.

Can't produce the 100th word until you've produced the 99th, and so there is a sequential component to them that you just simply can't get out of a GPU, and that's why they're so slow. So it's how quickly you complete the computation, not just how many computations you can do in parallel, and we do the computations much faster.

Understood. And is that the LPU design? Is that what's informed? Do you want to frame up the distinctions, maybe even start with the CPU, briefly and then GPU, TPU and LPU, if we can get like a three minute kind of primer and the main differentiation here. [00:07:00] Sure. CPUs they first became popular in the seventies and the value of a CPU, why you're using them today, why they become such a big deal is that they were easy to program.

Before CPUs, you physically had to rewire a computer. It took a lot of time, a lot of expertise. You couldn't just download the configuration. You had to rewire them. What GPUs did was with the same amount of silicon that thing that I showed you, the chip While they couldn't do things very fast in a sequence, they could do a lot of dumb operations in parallel.

Why that was important was as we started getting graphical user interfaces, like the buttons that are on your phone and in web pages and all that they made it much easier to use computers, but they were very expensive to render and so you needed these special chips to do that.

The LPU, as I mentioned, It doesn't just do computation in parallel. It also does it in a short time period. [00:08:00] So it gives you a little bit of both. So when you have a sequence of things, when you have text, when you're trying to complete the next sequence in a genome or in DNA when you're trying to figure out the next move in a strategy game or, the next note in a psalm.

Interesting. All of that depends on what came before and you can't do it in parallel. Each pixel on the screen is rendered independently. They don't need to rely on each other. But when it comes to language, every word follows a previous word. Where ordering matters, when the sequencing matters, that's where the LPU has an advantage over a GPU.

A GPU is rendering pixels on a screen, you can parallelize that, but an LPU, the sequence matters. That's correct. And like literally people just go, wow. But from a business point of view, to understand why this actually matters we're in the MySpace age of LLMs. Where MySpace actually measured the number of accounts signed up, whereas Facebook measured monthly active and then eventually [00:09:00] daily active users.

Engagement, right? The first principle of engagement is latency. I want you to imagine going to a website and it's loading loading. You've already bounced. You're like, what's wrong? I'm going to go somewhere else. 3 7 27 So the fact that it takes a minute to get an answer right now is one of the reasons that the engagement on these large language models is slow.

Absolutely. Absolutely. I agree. It's a source of frustration not being able to get an answer instantly. Of course, you know better than anyone else, Google invested a lot of time in reducing the latency in the response of search because they saw more usage from that investment. That's right. So they started off with enough quality that it was usable, and then it was all about the latency.

And I don't know if you remember, there was a little time stamp for how long it took to give an answer on Google in the early days. That was not there for your benefit. That was to make sure that the engineers inside of Google were always thinking about it and keeping it low. Yeah. Fascinating. [00:10:00] So let's jump back to the TPU and then we'll come back to LPU use cases, especially around genomics and defense and encryption.

I'm sure those are some questions you're probably getting to, but so what's the driver for the development of the TPU? How does it compare to the GPU? We built the TPU at Google because we had actually trained models that outperformed human beings. And we did the math, actually this engineer named Jeff Dean, very famous at Google, had done the math, and he did a presentation to the leadership team, and it was two slides.

And the first slide was, good news, machine learning finally works. We have a speech recognition model, outperforms human beings. Slide number two, bad news, we can't afford it. We need to double or triple our global data center footprint. That's another 20 to 40 billion dollars at the time, it's more now and that's for speech recognition.

If we want to do anything else, use it in ads or search or something, it's going to cost us a lot [00:11:00] more. That was internally branded the success disaster, success in that AI worked, disaster as in, can't do it. As a side project to 20% I happened to sit near the speech recognition team in New York.

That's where they were. And we were having lunch one day and they were telling me about this and I'm like, I could try and get this working on an FPGA. And they're like, great, do it. And they gave me a desk and started working on it at night and got it working. And then Jeff Dean, to his credit, he did a back of the envelope and he's you know what, we should just build our own chip.

It's going to be way cheaper than buying these FPGAs. He was right. So we built a chip. We're like, how hard could that possibly be? Turns out it's very hard, but when you don't know that it's going to be hard, it's, you get into it. So we built that first chip with about, I think, 15 people. And so when starting Grok the question that we had we weren't initially going to do a chip.

That wasn't the idea. What happened was right around that time, [00:12:00] the TPU paper came out, and so that's all VCs wanted to ask us about. And, I remember one of them asked, what would you do differently, and I'm like, oh, I'd fix the software. Because the software is completely broken, like you have to write software by, you're writing assembly code by hand, it takes forever.

And they're like do you think you could fix it? As compared to CUDA, for example, the NVIDIA. That's exactly the same, so NVIDIA has, most people don't realize this. NVIDIA has about 10, 000 ish people who write these things called CUDA kernels. It's not CUDA, it's CUDA kernels. Got it. And for example, there is a way to implement machine learning models called PyTorch.

It's the most popular way at the moment. I think there's about 60 different what are called convolutional kernels. Just for the latest generation of NVIDIA GPU. They're all handwritten, and they're all like, Ooh, when this parameter is like this, do this one. When this parameter is like this, do this one.