Data Skeptic

[MINI] Is the Internet Secure?

Duration:: 26m
Broadcast on:: 31 Oct 2014
Audio Format:: other

This episode explores the basis of why we can trust encryption. Suprisingly, a discussion of looking up a word in the dictionary (binary search) and efficiently going wine tasting (the travelling salesman problem) help introduce computational complexity as well as the P ?= NP question, which is paramount to the trustworthiness RSA encryption.

With a high level foundation of computational theory, we talk about NP problems, and why prime factorization is a difficult problem, thus making it a great basis for the RSA encryption algorithm, which most of the internet uses to encrypt data. Unlike the encryption scheme Ray Romano used in "Everybody Loves Raymond", RSA has nice theoretical foundations.

It should be noted that although this episode gives good reason to trust that properly encrypted data, based on well choosen public/private keys where the private key is not compromised, is safe. However, having safe encryption doesn't necessarily mean that the Internet is secure. Topics like Man in the Middle attacks as well as the Snowden revelations are a topic for another day, not for this record length "mini" episode.

[music] So welcome to our spooky Halloween mini episode of the Data Skeptic Podcast. Linda, for international listeners, can you tell them what Halloween is? Uh, I don't really know the history, but people dress up. [laughs] Fair enough. [laughs] And the reason it's our spooky episode is we're gonna talk about things that scare us, so what scares you? Breaking really fast in an LA traffic. Oh, that's a good one. Anything else scare you? What scares me? Creepy, sketchy streets in LA, walking down them. All your fears are LA-based, why do we live here? [laughs] Anything worth doing, worth taking your risk. Oh, you know what scares me a little bit, is whether or not the internet is secure, which is the equivalent of my even deeper fear, does P equal NP? Does that keep you up at night? I don't know what that means, so no. Well, it means, is computational complexity class P equal to computational complexity class NP, or is it a proper subset? I still don't understand what you said. Alright, so maybe we should break down what computational complexity is. What if I handed you a book that was filled with random numbers, and I asked you to figure out what was the largest number in the book? How would you do it? Well, if I have unlimited time and if the reward is great, I could just flip every page and look at it and then determine. That's right. You have to look at every number in the whole book and keep track of what's the biggest one you've seen so far. Right. And keep moving it up. Alright, and not until you finish the book, do you have the answer? Alright. What if I handed you a dictionary, and I asked you to figure out if a certain word was in the dictionary? How would you do that? Well, the dictionary is alphabetical, so you could just go right to that spot, assuming you know the alphabet and how to spell it. Well, how do you go right to the spot? You just magically open to the right page? No, you have to know the alphabet and know how to spell it. Well, walk me through that in a little more detail. Okay, well, if your word is cat, it starts with a C, so I get a C section, the next letter is A. Well, how do you go to the C section? I just flip it. Sometimes a dictionary, at least my old, old school dictionary when I grew up with, had divots that had the letters imprinted on it at the side, so I knew where the C section was. Okay, from there, once you're in the C section, how do you get to the A section of the C section? Well, I'd go to the front because A is the first letter in the alphabet. And then how about the T? Well, T, I don't know exactly where it stands, but I'd be like, eh, in the middle, so then I'd start flipping pages and start jumping around till I get closer to T. Describe how you jump around. You go to the middle and let's say it's C-A-L. So if you're an L, then I would say keep going. To the right. To the T. Yeah. Forward. And what if you go forward and you end up at C-A-Z? Then Z is the end, so you've gone too far, so go backwards. But don't go past where you've already been. Right. So the classic lookup, it's called a binary search that you do with the dictionary, is you first go to the middle, is my word before or after this word. If it's before, you go, well, if it happens to be your word, then you have it, but if it's before, you just stick to the early half of the book, and if it's after you stick to the later half of the book. So you just cut the problem in half in one step, and then you can do the same procedure. Go to the middle of whatever the section you're in there, and you just cut the problem in half again. So at every step you take, the size of the data you might have to look at is cut in half. So we could call that a log of n problem. That sounds to me like a solution, not a problem. Yeah, that's the algorithm to solve it. The algorithm is called log of n problem? The algorithm that I just described is called binary search. The time it takes to execute that takes log of n time. Let me back up. So the first thing you have to consider is how big is your data set? So for example, did you ever have a kid's dictionary? No. When you were in like first grade or anything? We weren't rich. You only have a grown up dictionary. Any kid word you could find in the grown up dictionary. That's true, because the kid dictionary is a subset, it's smaller. There are less words in it, therefore it should be easier to find the word to determine if the word you're looking for is in it. The bigger your dictionary gets, you can still apply the same algorithm. You go to the middle, and then you pick which half you're going to look in. At every step, you cut the problem size in half. And the problem size is how many total words you might have to look through. Back to my question, do you know what logarithms are? They're an equation. Yeah. Essentially, the analogy I usually use is the Richter scale, which is for earthquakes. A Richter scale of four is not just a little bit bigger than three. It's like way bigger that they scale off exponentially. That's the logarithms scale. Basically, as you go up in numbers, logarithms go up smaller and smaller increments. That's exactly what happens in a binary search, because even if they make the dictionary bigger and bigger, because you can cut it in half each time, it's pretty quick to find your answer, at least compared to the first problem I gave you, which is a book filled with random numbers, where if I keep making it bigger and bigger, you'll have more and more numbers to look through. So if you have these two problems, the find if a word is in the dictionary problem, and the find if a random number is in a book of random numbers, the dictionary one will get obviously a little bit more difficult if you get a bigger dictionary, but not proportionate, because it goes up at the size of log n instead of the size of n. What is n? What is the size of your input set? Input as defined as what? In the case of the dictionary problem, it's the size of the dictionary, how many words are in there? In the case of the random number maximization problem, it's how many numbers are in the book of random numbers. We call this in computer science, big O notation, which just is, actually I don't know why they call it that, but it's the terminology you use to say, what's the computational complexity? So the dictionary problem is very nice, because even as it gets bigger and bigger, you can still solve it pretty quick, because it's cutting it in half every time. Whereas the find the maximum number, the bigger the set is, the more numbers you have to look at. It's in linear time, so it's not quite as fast of a problem to solve. And all algorithms, all approaches to solving problems can be analyzed in this way. For example, can you describe for me how you would solve a word search problem? You know, one of those things where there's a bunch of jumbled letters, and they tell you you need to find like five or six words that are hidden in there that can be, you know, horizontal, diagonal, or vertical? I think usually a word starts to stand out at you first off, so you just circle that one. And then from there, you start going, you read it from left to right is the most natural, and then we don't see left to right, then you go up and down, they start looking diagonal, backwards. In the worst case scenario, if the word doesn't immediately jump out to you, you might, let's say, have to take the first letter of the word, and then go row by row and say, have I found that letter? And then when you find the letter, say like, oh, could this be the start of the word, and look all around it, correct? Yeah. That might not be the most efficient way to solve a crossword puzzle. In fact, I suspect there are some clever computer scientists out there who have really good word search solving algorithms that are better than what I'm going to describe. But nonetheless, so you can scan a row by row column by column and try and find the first letter of the word you're looking for. And then you can kind of look and say, oh, does it go horizontal, diagonal, or vertical off of this? And then for every word, you do that every time. If the grid is like 20 by 20, that's your n, that's your input. As that grid gets bigger and bigger, the problem actually gets harder and harder as n squared, because now you have more and more places to look for it. So can you see how a word search that gets really, really big is actually even harder than the what's the biggest number in this book, harder? That it will take more and more time to solve it, because you might have to potentially look at all the rows and all the columns. In the book, you just have to go one time through it. I mean, yeah, if you're viewing it by how many numbers you look at, I guess. Right. So the key is always about your input size. How many rows there are squared or rows times columns, so if they're equal size, it's n squared. These are what we call polynomial time problems, because you know what a polynomial is, right? Is it a shape? No, it's like when you have a function or an equation, it's like x to the 2 plus 3x kind of thing. That's been a long time. Polynomials are, yeah, anything that's kind of like some variable raised to an integer value like x to the 9, let's say, any algorithm that takes an amount of time that's polynomial to the input is called the polynomial time algorithm. And actually, there are problems that are worse than polynomial. There's some that are like exponential, or like whatever the size of the input is, it'll take like 5 to the n, where n is the input, which are like these really difficult to solve problems, or at least really time consuming. So let me bring you back to another example. You like to go wine tasting, right? Yes, I assume you do too. I love it. Where do we go? What are some of the places we have or and/or will go? We like to go to the central coast of Southern California, such as Passo Robles, St. Louis Bismol. Good one. And I can't name all the other small cities. You mentioned the first place I took you. There's Templeton. Well, the first place you took me, Temecula is not in the central coast. Fair enough. So, imagine this. What do you like to do when you go wine tasting? I like to taste wine. But more than that, you don't just go to one winery and taste a bunch of wines, right? That's step one. Okay. What's step two? Repeat at a different winery. And how many wineries might be in an area? Oh, there's a lot. Right. So many. Over 30, for sure. Yeah. What would be a better use of time? Being at a winery or driving between wineries? Well, being at a winery is more fun than driving. So it would make sense that we minimize our driving, right? We want to strategize, that we have to drive the least amount of time. Sure. If all the wineries were on one single road, you could just start on the furthest east and drive west and stop at the one by one, right? But rarely are they laid out so well, correct? Often they're like spread all over the place. Well, the whole point of a wine country is that there's hills and valleys. Yeah. So the roads aren't just orthogonal. Right. Oh, good word choice, by the way. It actually can be a hard problem to decide the right order you want to go in if you want to minimize your travel time. This is the equivalent of a problem in computer science called the traveling salesman problem, where there's a traveling salesman who wants to visit every city but not have to go through the same city twice in his travels. So you want to visit as many vineyards as you can without visiting one twice and minimizing the time you spend traveling. How would you go about planning that out? Well, you can look at a map. Right. You can map it out and pick the one that has the least amount of mileage. Okay. You just trivially trivialize the problem when you said you just map it out. That's the hard part, Linda. Google Maps does it for me. Google Maps does help you out, but I don't believe Google Maps can solve the traveling salesman problem. And it's because this is a provably hard problem. Now like I said, there are some situations like if all the wineries are on one single road where it's trivial. But in general, in the average case, this is a really hard problem. It takes worse than polynomial time to solve it. So as the number of vineyards grows, you take significantly more and more time to find the best path. Now there are some approximations where you can find a pretty good path in a decent amount of time. But as that size gets bigger and bigger, it's really hard to find the optimal solution. In fact, this is a class of problems that are called NP complete. Do you remember a minute ago when we talked about polynomial time problems? P? Well, NP are we suspect harder than P. Just as we know that P problems, polynomial problems are harder than logarithmic problems. And there's this class of problems in P that are all known to be equally difficult. And at a glance, they sometimes seem like they don't relate to one another. For example, if you wanted a path to go on a trip and you have one suitcase and you have lots of options you can put in it, right? Now some things have different weights and you don't want to go over the weight limit and some things have different values to you. For example, you definitely want one toothbrush, but you don't need two toothbrushes. And this is what's called the knabsack problem. How do you fit in the most amount of stuff that keeps under a weight limit but gives you the most value? It's actually equally as hard as the wine tasting problem we described. Pretty crazy, right? Wow. No wonder it takes me so long to pack and play in a trip. Yeah. Because a computer cannot do it better. Well, no, computer can definitely do it better. It's just these are provably hard problems and they're all roughly equivalent. And there's a lot of these. Another one is prime factorization. Is that one that you face as often as you're packing and or wine tasting? I don't know what that is. Prime factorization is when you have a large number or presumably a large number and you want to find out what prime numbers can be multiplied to arrive at that number. So let's do an example. Let's say your number is 20 and you want to find what prime numbers can multiply to get you 20. What are some of the prime numbers below 20? Seven. Seven. Yep. Seven's prime. Seven. Three. I think two. Fifteen. Seventeen. I think that's it. So you can do two times five. Those are all prime numbers that get you to 20. That's a prime factorization. Oh, so it could be any number. Yeah. Yeah. But you see how we had to think about it. It is a NP problem. It is a hard problem to solve. Now there's this open thing in computer science. We're not actually sure if NP problems are definitely harder than P problems. It could be that these very smart personal one day show us that P equals NP. Do you know what that would mean? No. That would mean that these hard problems like the prime factorization or the knapsack problem or the traveling salesman problem can be solved in polynomial time the same way a word search can be. But as of today, no one knows how that could be done. So we can say for sure if you're good at solving word searches, solving something like prime factorization is much harder and it takes much more time given your input set. Now why is this important to the internet? Well, we're worried about the internet being encrypted, right? Some people are. Do you buy anything online ever? Always. Right. One King's Lane, for example. I haven't bought anything from there. That's true. I haven't seen that on the bill for a while. What have I seen lately? Amazon.com. I don't even notice Amazon because that could be me or you. That's your best way of being anonymous by Samsung. I don't know which of this it is. But anyway, you would hope that our credit card is not being read by anyone at random on the internet while we're doing that transaction. Correct. Right. What makes you so sure it's not being read by other people? The browser where the URL is has a green lock icon. That's right. And it's supposed to mean it's safe. It's supposed to mean it's safe. How is it safe? It uses SSL technology. Yeah, or actually the SSL, we're going to move away from that eventually. But that's a topic for another day. Oh, okay. It's encrypted is the important thing to say. Do you, by chance, have a vernacular description of encryption you could share with our listeners? So basically, they take information like your credit card number and they replace those numbers with something else, some of their values. But they do this in an automated fashion. They each create their own algorithm or whatever kind of code it is. And then in addition, they add more information so that the length is even longer because if you just had your original length, people would know like if you're spelling out numbers or letters, people always know what the value of thousands. That's how code breakers work. So they add an additional information so that no one can figure out what your original number is. You're essentially correct. Yeah. Encryption. There's lots of different types of encryption in theory. Have you ever encrypted anything? No. No, never. Not even your kid like you had a decoder ring or anything? No. Okay. Well, did you by chance catch that classic cryptography episode of Everyone Loves Raymond? No. I never watched the show. Oh, yeah. There was an episode where he tried to encrypt his diary when it was a kid. His strategy was whatever word he wanted to write, he would take the first letter of the word and move it to the end of the word. So if he wanted to encrypt the word podcast, he would write "oddcast," puh. Well, that would be really obvious after a while. Yeah. In fact, his mom came up with a proper decryption algorithm. So that's bad for Ray, right, because that meant his encryption was broken trivially. What we really like when things are encrypted is that, first of all, it's easy for us to encrypt it because we don't want to wait all day to buy something off Amazon, but we want to make it really, really, really hard for anyone to decrypt that. There's another type of encryption you can do where you just swap out letters. So instead of A, maybe you write T instead of B, you write L, and you just kind of have this permutation of letter replacements. How good of an encryption do you think that is? Not that good because, like I said, vowels, they always figure out. Yeah, you can do frequency analysis. The most common letter is probably the replacement for E. And then you're on your way to decoding that. That's also a bad encryption. And like I said, the goal for encryption is that it's relatively easy to encrypt it, but really, really hard to decrypt it. And the way that's most commonly done is with an algorithm called RSA. Have you heard of that? Nope. Really? You never encountered RSA? Maybe, but probably in one ear out the other. All right. Have you heard about public keys and private keys? Yes. Okay. Do you know what those are? I think it's like an encryption method. So everyone creates their own, it's an algorithm, I assume, and so everyone creates their own so that when someone intercepts that feed of data, they cannot easily figure out what it is. And then the public ones are just public and it's not as secure because everyone knows what they are and that if someone intercepts your information, they could use the public key to unlock it. That's close, but not quite right. Public private key encryption, which is the basis of RSA, is the idea that you have a public key which you share with anybody. You just give it away. Oh yeah, I remember this. Yeah. And that tells you how to encode something. And there's two steps. We're going to go through them because they're actually really important. You know what exponents are, right? Yeah. Can you share with the listeners? They are the tiny numbers next to the big numbers. Right. And you know what modulo is? No. What remainder? Reminder. Yeah. It's like a fraction thing. Division thing. So, modulo is the mathematics word or the, let's say like after eighth grade word for describing the remainder of division. So if I said, what is 12 modulo five? Do you know the answer? I don't know what you mean. Well, what's 12 divided by five? Oh, well, it's decimal, right? So do the best you can off top of your head, 12 divided by five, 12 divided by 2.5. Pretty close. You're getting there. What is it? Okay. So if you want to do division this way and not go into a decimal, then you say it's, it fits in twice evenly. So five times two is 10. You can't stick another five in there. So you say it 12 divided by five is two, and then the remainder is two that I got these two extra things I don't know what to do with, right? So if I told you I did a problem and the remainder was two, what did I start with? You can't tell. It could be that I started with 12 divided by five, like this case. Or I could have done nine divided by seven, also remainder two. So the module of operation is essentially destructive to the data. So here's the way RSA works. You take whatever your message is and you raise it to a certain power, you take it to an exponent and that number is there in the public key so anybody can do it. And then you take that number and you modulo the other part of the public key. So now what's, the result is, is something that looks like nonsense. If it was just the exponent, someone could reverse that because exponent is reversible with logarithm. But once you do modulo, the information is very effectively encrypted. And the only way you can decrypt it is if you have also have the private key. Because the public and the private key together are what can tell you how to get this unique valued number that gets hidden by the modulo. So that's kind of the, the really secret part about it is it's hidden there. But then when I later go back to decrypt it, I have the companion of the public key that would only have arrived at one possible value after the modulo operation. Essentially because both the numbers come from the same prime factorization. So if you don't have the private key, you would have to try every possible key and brute force attack it, at least presumably. Could someone be really clever and really smart and good at guessing the accompanying pair to the public key? Could they guess the private key some kind of way? Well, if, if they could do that, that would mean that they are really good at prime factorization. Yet, we've proven that prime factorization is a really, really hard problem. That doesn't mean someone couldn't come along and find a really efficient solution for it. But no one has yet, and some of the smartest people in the world have studied this problem. Not only have they studied this one, but they've studied all the other problems that it's equivalent to, and as of yet, no one's been able to find a polynomial time version of solving NP problems. And in fact, I spent a lot of time in this in grad school. I am in lieu of a proof pretty much utterly convinced that P is a proper subset of NP, that these truly are hard problems. And that any solution involves exploring the entire search base of possible solutions, which is the equivalent of a brute force attack. It is groundbreaking front page headline news if and when it finally happens. Could the government do this or secretly have done it? It's really unlikely because that would mean that they have a staff of incredibly smart people who solved one of the hardest problems in mathematics, and they have kept and continue to keep it a big secret. And no one else has independently replicated the same result or is on the same path. So that's possible, but if we're going to trust anything, trusting that factorization of prime numbers is a really hard problem is a good way to go. And it's worth noting here that with all the Edward Snowden stuff, that doesn't relate to decryption. That was where the government was able to snoop on data before the encryption actually took place. So while encrypted data was transmitted over the wire, and as long as it was properly encrypted, we can trust that it's safe if you can read the data before the encryption takes place, you have essentially cheated, you cut in line. So I know we're running really, really long for many episodes. I guess we can wrap this up kind of quick and just say that it's a scary idea that P might be equal to NP, but it seems really, really unlikely given how many people have looked at that problem. And if we assume that that's not true, then in fact prime number factorization is a really hard problem that can only be solved with brute force. And if you pick a big enough key, brute force will take a really, really long time. And that's why we can trust RSA. [MUSIC]