Data Skeptic

Shakespeare, Abiogenesis, and Exoplanets

Duration:: 58m
Broadcast on:: 25 Sep 2015
Audio Format:: other

Our episode this week begins with a correction. Back in episode 28 (Monkeys on Typewriters), Kyle made some bold claims about the probability that monkeys banging on typewriters might produce the entire works of Shakespeare by chance. The proof shown in the show notes turned out to be a bit dubious and Dave Spiegel joins us in this episode to set the record straight.

In addition to that, our discussion explores a number of interesting topics in astronomy and astrophysics. This includes a paper Dave wrote with Ed Turner titled "Bayesian analysis of the astrobiological implications of life's early emergence on Earth" as well as exoplanet discovery.

[ Music ] Data Skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes. [ Music ] Dave Spiegel holds a PhD in astronomy and astrophysics from Columbia University. He went on to do a postdoc position at Princeton and also spent a few years as a fellow at the Institute for Advanced Study. His work is focused on numerous aspects of exoplanet discovery, and he's recently transited the country to join the Stitch Fix team in San Francisco. Dave, welcome to Data Skeptic. Thanks, Kyle. Happy to be here. Oh, yeah, really glad to have you. So, I think we're actually going to end up discussing a couple of different, shall we say, coincidences or the likelihood of improbable events today in a couple of different contexts, but for listeners who follow the show regularly, I can start off by mentioning one, and that's that Jake van der Plass, who was the last interview guest, is going to be speaking, or by the time this has been recorded, we'll have spoken at your new home at Stitch Fix. So, while I can't ask you how that went, unless you want to lie to me, but interesting that this all kind of cosmically lined up, I guess. Yeah, I guess an interesting astral-heavy portion of the Data Skeptic podcast. Definitely. But we can move on to maybe more noteworthy coincidences, and the reason I actually first got in touch with you and asked you to come on the show was a response you'd written to me, a really great one, to my episode 29, a mini episode with my wife, where we discussed this classic idea of could monkeys banging on typewriters, just outputting random letters, eventually produce the works of Shakespeare? And you had some excellent criticism for me on that, so I thought it would make a great episode, at least to start with, I'm discussing some of those aspects. Sure, so I guess I want to start by saying I've been listening to the Data Skeptic podcast for a while, and I really enjoy it. I love the conversations that you have with your wife, and I think it's a really good format to make sure that the podcast is accessible to people at a range of levels. And I thought the one on monkeys typing on typewriters was a really interesting one, is a problem that I thought a little bit about myself. The funny thing is, I think, I mean, you were technically correct that there is a non-zero probability that the monkeys will type all of Shakespeare's works, but I think your wife was actually a little closer to correct. On that one, because she was saying, "It'll just never happen." And the probability is exceedingly low, at least the way that I did the calculation. Yeah, your first point to me, I think was actually the most important one, and it was one that I wasn't real rigorous about in the episode. But your first kind of reply to me was about, "Well, what's the actual question we're asking here? Could you share your perspective on what's the interesting question to be asked about this concept?" So, like, obviously, there are a bunch of questions you can ask. I mean, you could ask how long until monkeys have typed all the letters that Shakespeare used, and so, you know, then it'll take, like, somewhere a little more than 26 character strokes until you get all the letters. I think the most interesting question is, how long until one of the monkeys writes all of Shakespeare's works? Because the way that you had it set up, I just, I found a little confusing to think about how long it takes for all million monkeys to type the works of Shakespeare, because I wasn't sure how the output of the million monkeys' keyboards should be ordered. That's a good point. Yeah, there's kind of like a research, what we, I don't know if you know this term, but researcher degrees of freedom, where if you don't well specify your problem, then you kind of post-construct it like, yeah, are they all streaming data into one pipeline or are we concatenating their outputs? It's a bit arbitrary. Yeah, so if you think about how long until one monkey writes all of Shakespeare's works? The relevant question, I guess, is, how many characters are there in all of Shakespeare's works? You know, that's how many characters a monkey needs to have typed, and then how many choices are there with each time a monkey hits a key? If we imagine just a really restricted keyboard that has like 26 characters on it, so we're not even requiring spaces to come between the words, then the total number of possible works that length that a monkey could have written is 26 to the number of total characters in Shakespeare's works, which I think we said there was somewhere around 4 million characters in Shakespeare's works, and so 26 to that power ends up being something like 10 to the power of 10 million, and that's a really huge number. So, you know, you could work it out and say, like, okay, I need that many trials, and, you know, how fast as a monkey hit a key, you know, maybe like one second per character, so you could work it out and get an answer in seconds, and then you could try to transform that answer from seconds to like days or years. The really funky thing about the number 10 to the 10 million is that it's a number that's so big that it actually doesn't really have units attached to it. In other words, if you say 10 to the 10 million seconds versus 10 to the 10 million days are basically the same thing, right? Because like the difference between, they're like 100,000 seconds in a day, basically, so 10 to the 10 million seconds is 10 to the 10 million minus 100,000 days, which is, you know, if we're just accurate to like that first digit, like 10 million, then 10 million minus 100,000 is still basically 10 million. It's a kind of funny thing, if numbers get big enough, then they don't really have units attached to them. Like 10 to the 10 million seconds is the same thing as 10 to the 10 million years. Yeah, in addition to the arithmetic errors I had made that you captured, you also caught me on what other point is I picked this epsilon value, and it was very arbitrary, which, you know, most epsilon sort of values are, but as it happened, my analysis was quite sensitive to my selection, which makes it a rather, let's just say, unrobust choice. But then you kind of reframe the question in what I thought was sort of a clever way. Could you describe a bit about how you went into figuring out that we were actually should be calculating the number of monkeys to do this rather than the likelihood we'd produce it? So, okay, so we've got, you could refer to it as a space that is 10 to the 10 million. And what I mean by that is that's the number of possible sequences of characters that are the length that we're thinking of, whatever we said, like four or five million characters long. And so what we want to do, like one of those 10 to the 10 million strings of characters corresponds to the works of Shakespeare and all the rest correspond to things that might actually be really interesting, like the Bible is in there and like books by Stephen King are in there, but they're not the works of Shakespeare. So if we're grabbing sequences of characters from that space, then we have a probability of one in 10 to the 10 million of grabbing the right one each time. So the question that the way that I was thinking of it is how many times do we need to grab a sequence from that space in order to have a reasonable probability of grabbing the one that we're looking for, namely, you know, that needle in the haystack, the Shakespeare string of characters. One kind of natural thing that you might say is, okay, if the space is 10 to the 10 million, then I want to grab about 10 to the 10 million strings. And it turns out that's basically right. Like, let's just take a more manageable number. Let's say we have like a thousand balls in a hopper, then I draw a thousand balls with replacement from that hopper, and I can ask what fraction of the balls did I end up touching during my process of drawing them. And the answer ends up being, it's like close to two-thirds. The fraction that I don't touch in my thousand draws is pretty close to one over E, and the more balls there are, like if there were 10,000, it gets closer to one over E and so on. So with 10 to the 10 million, it's basically if I take 10 to the 10 million draws, then I've missed one over E of those sequences. And so if I want to make sure that I actually get all of the sequences, and I'm not missing like 37% of them, then I've got to draw 10 to the 10 million draws several times to drive down the fraction that I've never touched to an acceptably low fraction. But if I just do it once, then the probability that I got all of Shakespeare's works was something like 63% close to two-thirds. Yeah, and then as we kind of back into how many monkeys would be required, what's interesting about that is you cancel out the problem I was clearly struggling with in my notes was, you know, the precision of these large and small numbers. But if we're just calculating these exponents, it's kind of easy to work the math out a bit, and you ultimately drove that down to calculating the amount of Hubble time it would take for this event to happen. To start with, could you tell us about what Hubble time is? So a so-called Hubble time is the age of the universe. So right now, the age of the universe is according to our best models, it's somewhere around 14 billion years. So the Big Bang happened about 14 billion years ago, that's one Hubble time. And so this is where that thing about the number being unitless comes in, because it turns out that it doesn't really matter how you phrase the question in the end. Is it a million monkeys or is it one monkey? Is the monkey typing, you know, a character per second or a sonnet per second? Kind of no matter how you phrase it, the answer to how long it takes is just 10 to the 10 million. Like, it takes a million monkeys 10 to the 10 million seconds, or it takes one monkey 10 to the 10 million Hubble times, or any combination thereof. As long as the time unit doesn't get in the vicinity of 10 to the 10 million seconds, then the answer is just going to be like 10 to the 10 million. That's how many Hubble times, that's how many seconds. I found this kind of funny, because I got that the answer was about 10 to the 10 million seconds, and I was thinking, okay, I'd like to rephrase this in terms of the age of the universe, because 10 to the 10 million is such a huge number. But then when I tried to reframe it, the answer is still 10 to the 10 million. Yeah, so we're looking at actually a pretty unlikely event, it would seem correct. I would say that's a pretty unlikely event, like I'd say that it's basically true that that will never happen. Right, so your analysis is quite good, and I'm going to go ahead and update that page. Well, I'll leave my original there for, I guess, posterity sake, but point to this episode and our discussion here and correct some of the errors. But it also made me think another interesting question to pivot on here is more about, you know, we will see in any sufficiently long randomized string, we will see subsets that appear to be regular, have some configuration to them. So looking for the entire works of Shakespeare is a rather lofty expectation because it's a very specific sequence, but maybe looking for any work of any author within that string, we'd see much different probabilities, kind of the same way the, I don't know if you're familiar with the birthday paradox. What's the birthday paradox? In a crowded room, let's say there's 50 people, the odds that someone shares your birthday are quite low, but the odds that two arbitrary people in that room share a birthday are quite high. Oh, yeah, the crossover point is what's somewhere around 20 or 25 people or so? Yeah, I seem to remember like 30 being sort of the number where you get to that like, there's a 50% chance mark, something along those lines. But yeah, so I guess my hubris, if you will, was forgetting that I'm asking for a very specific event and kind of commingling that with the idea that we will see sub-sequences that are non-random in a sufficiently long random string. Yeah, that's right. Most of the strings that long, I guess are total gibberish, but way more than the number of works of Shakespeare are like the number of possible sequences containing English words. Yeah. This actually reminds me of Calvin and Hobbes cartoon that I saw once. Calvin is the tiger, right? Yes. Oh, wait. Where's Hobbes? I think that's Hobbes, yeah. Okay. Well, whichever it is, the kid goes to the tiger and says, "Hey, I'd like you to take a look at this poem I just wrote." And the tiger reads it and says, "You know how they say if you had 100 monkeys at 100 keyboards, it would take them infinitely long to type all the works of Shakespeare?" And the kid says, "Yeah." And he says, "Your poem, five monkeys, three typewriters." Oh, I love it. So I was really interested that you came down to Hubble time with this, and it's something that admittedly I'm not well versed in, but have always kind of meant to go learn more about. I'm curious if you could share a bit about how we actually measure Hubble times. I have a rudimentary understanding of relativity. So it just seems a bit odd to me that we can say the universe is a certain age when I know that different objects in the universe are experiencing time at different rates. Am I looking at that correctly? So that's a really interesting point. There are some objects in the universe that have experienced a very different length of time in the universe than others. Photons, for example, like if you could imagine a little person that physicists listening to this might get kind of upset at this analogy. If you could imagine someone like riding on a sunbeam, riding on a photon, for that person, you could imagine time would pass for that person, but for someone else watching that person zip by, the person who's stationary relative to the sunbeam wouldn't see the person on the sunbeam aging at all. So what you're getting at with, by talking about relativity, right, is that things that are moving relative to one another experience time differently. So when there's a velocity between two observers, each one sees time passing more slowly on the other one. It feels kind of like a paradox. How can each observer see the other observer aging more slowly? But it turns out it's actually not a paradox, and when you try to frame it in a way where there would be some observational weirdness, it turns out that the universe gives us a way of understanding what actually happens. So the standard example that people try to set up for a paradox is called the twins paradox. So the idea here is that two twins are born on Earth, and we take one of them and send it off on a spaceship at a large fraction of the speed of light. Like the reason I was talking about a sunbeam is because the closer you get to the speed of light, the slower time appears to go for someone who's not, you know, the person who's watching you travel. So we put one of the twins on the spaceship going out at a large fraction of the speed of light for like 30 years, and then that twin turns around and comes back to Earth for another 30 years at a large fraction of the speed of light. And then the two twins are next to each other, and which one is older? Because according to what I said a moment ago, each one saw the other one aging... I mean, if they could actually see each other right there, ones in a spaceship. But like if they could see each other, each one would see the other one aging slow relative to him or herself. But, you know, when they're back next to each other, that's an observation where they're both at the same location in spacetime, and so you can actually do a comparison. It turns out that if you're careful about how the relativity works out, there is an answer, and the one on Earth is the one who's older. Basically, all the aging for the one on Earth happens as the one in the spaceship that was going out was turning around. This kind of leads actually from special relativity into general relativity. What's happening is that the one in the spaceship is experiencing a very large acceleration, right? Because he or she is going from close to the speed of light going one direction to slowing all the way down to zero and then turning around and going close to the speed of light the other direction. And if that happens within a fairly short amount of time, like, you know, a few days or a year or something, that's an enormous acceleration that that twin is experiencing. General relativity is what comes out of recognizing that accelerations and gravitational fields are indistinguishable from one another. That you can work this out entirely in special relativity, but you can also understand this as during the turnaround, that twin was experiencing a huge gravitational field. As you remember, if you saw interstellar this past summer, and I guess a little spoiler alert here, there's a planet very close to a black hole in interstellar. When you're very close to a black hole or when you're very close to any strong gravitational source, that's another circumstance in which clocks move slowly. If you're sitting on a planet that's right near a black hole, someone who's far from the black hole would age a long time while you're aging just a little bit. And actually, there's some really interesting stuff that wasn't totally explicit in the movie. I could get back to it if you're interested about that planet near the black hole. But in the movie on the planet near the black hole, like one hour on the planet corresponds to like seven years for someone who's far away from the black hole. But so with the twins paradox, the twin who turns around is during that turnaround phase in an enormous gravitational field. And that's basically when all the relative aging happens. It's kind of the same idea as the one hour to seven years. When the twin gets back to Earth, the one who is traveling really fast has aged only a little bit, but the one who was on Earth has aged a full 60 years. This is all by way of saying, yes, there are paradoxes of bad stuff moving. However, there's what's called a rest frame of the universe. We sitting here on Earth are in a frame of reference that is not moving very much with respect to the so-called rest frame of the universe. I mean, we're moving at like a few hundred kilometers per second relative to it. So that's like kind of fast by human standards or really fast by human standards. But it's not fast relative to the speed of light. We have experienced basically as much time in the age of the universe as is possible to experience. Other things that are moving fast relative to the rest frame of the universe have experienced less time since the Big Bang than we have. So it seems that we can conclude from our discussion about the monkeys and typewriters that while some unlikely events do happen, like, you know, it's unlikely either you or I will win the lottery, someone will win a lottery. And so there's kind of a dichotomy here. In my mind, there's like a natural transition into your work in exobiology. We have one example of a biogenesis here on Earth. Yet we don't want to fall into the anthropic principle of claiming that just because it's happened, that means that it was destined to happen because we're in that situation. And I really enjoyed your paper, Bayesian analysis of the astrobiological implications of life's early emergence on Earth from Penas. I was wondering if you could just give us a summary of the work you did there. The idea there was that people in the astrobiology community and the origin of life community have, I should say, some people have kind of taken it as a given that since life emerged fairly early on the young Earth, on like a geological timescale, that means that life is pretty common in the universe. And so I wanted to take a critical reexamination of that from a proper Bayesian standpoint. So just to go into that kind of intuition behind what people think a little more, it's not a crazy intuition. The facts are the Earth has been around for like four and a half billion years. For the first few hundred million years of the Earth's existence, it was getting belted by a lot of stuff. Like a lot of asteroids were hitting the Earth. Maybe life was arising then, but even if it was, first of all, we wouldn't have any record of it now. For the second, it probably would have been sterilized by comets and asteroids hitting the young Earth. I mean, one of the really giant collisions was the one that created the Moon. And that happened close to four and a half billion years ago. That was when the Earth was really young. We were hit by something about the size of Mars and the Moon was spit out of that collision, or the stuff that ended up coalescing into and cooling to form the Moon was spit out. But for like the first 600 million years or so, until like 3.9 billion years ago, there was a lot of crap hitting the young Earth. However, since about 3.9 or so billion years ago until the present, the Earth has been basically continuously habitable. And, you know, it's really hard to evaluate geological evidence from a really long time ago, but you could argue that there would be at least a chance that for a good fraction of that 3.9 billion years, if there was life that it might have had a chance of leaving evidence until the present. So when did life actually arise? Well, we don't know. And I'm not a biologist, I'm not a paleobiologist or geologist or anything. But from talking to these people and reading papers, it seems that basically everyone agrees that by 3.2 billion years ago, there was life. A lot of people think that by 3.5 billion years ago, there was life. And there are optimistic and not crazy readings of the data that suggest that by about 3.7 billion years ago, life had arisen. And so if we go with that, then that means that life had arisen within the first 200 million years of the Earth being habitable, being continuously habitable. If that's the case, and we'll assume that that is the case now, if that's the case, that doesn't mean that life actually came into existence at 3.7 billion years ago, because it would be kind of weird if our earliest evidence of life was actually evidence of the earliest life. Like, life probably had to be around for a while until it built up enough chemical pollution, you could say, of the biosphere, created enough of a biosphere that evidence of that biosphere would last almost 4 billion years later. So, you know, life might have arisen 200 million years after the younger allowed for it, or it might have been like 10 million years. We just don't know, but somewhere in that 200 million year span, life showed up. That's what we'll just assume. That process of life coming from no life is often referred to as a biogenesis. So I might slip into calling that a biogenesis. So the reason why it's not sort of a crazy thing to look at, so again, the intuition is that 200 million years is a small fraction of 4 billion years. So life showed up early. It didn't have to show up early. Since it did, that means that a biogenesis is probably a probable process. So my collaborator in that paper, that was a professor at Princeton University named Ed Turner, what he and I realized is that that kind of standard intuition is not crazy. If I want to know how likely it is that there will be a deer outside my window, something that it's not crazy to do is just stand by the window and count how long I have to wait until the first deer comes into sight. If I were in Princeton, New Jersey, which is where I was thinking about this, I'd probably have to stand by the window for a couple hours and a deer would come into sight because there's a lot of deer around there. And if I were in Manhattan, which is where I lived after Princeton, I might have to wait like 100 years or, you know, a thousand years because there are just very few deer wandering down the streets of Manhattan. And so I could rightly infer from that that the probability of deer arising outside a window in Princeton is much higher than the probability of deer arising outside a window in Manhattan. That's the basis behind the argument. However, what we realize is that there's a potentially very strong selection effect that's not present in the deer example when we think about a biogenesis. We are not just on a planet where life arose, we're on a planet where life arose, and a whole lot of evolution happened after that that eventually led to creatures that can wonder about probabilities, the probability of a biogenesis, you know, namely people. And so how long did it take until creatures that could think about this question showed up? Well, that was 3.7 billion years or whatever, like almost 4 billion years. So what's the minimum evolutionary time scale that has to pass from the earliest microbes until creatures that can wonder about a biogenesis? Well, we have no idea. We have some idea. It probably can't be like 10 minutes, probably can't be 10 years from microbes until people. And it doesn't have to be people, like it could have been really smart dolphins or any kind of alien. But maybe it couldn't happen in a million years, probably not. If the minimum evolutionary time scale is like 3 billion years, then no matter what the probability of a biogenesis is, we couldn't possibly have found ourselves on a planet that had late a biogenesis. Because on those planets, there's no one thinking about a biogenesis. It might be that very few planets in the universe are inhabited. I hope that's not true, but it could be. However, on all the inhabited planets that have creatures wondering about a biogenesis, life arose early on all of those, because it had to in order for intelligent observers to show up 3 or 4 billion years later. So basically, the point of our paper was taking into account the relative influence of what we're getting from the data and what we're getting from a possible prior idea that we might have about the probability of life arising. We wanted to use a very uninformative prior. When you talk about Bayesian statistics, basically what it is, is you come in in any circumstance, you come in with some idea of the probability of something that you're trying to evaluate, and then maybe you gather some data, or maybe not. And then after you maybe gather some data, you think, "Okay, now, what is the probability of that thing?" If you didn't gather any data, then your posterior idea is the same as your prior idea. If you did gather some data, then what you want is for your prior to have been flexible enough to be influenced by the data. If your prior is of a very rigid form, then your posterior, after you gathered the data, will be the same as the prior. And then you wasted your time gathering the data, because it wasn't going to change your mind anyway. Since Ed Turner and I are astrophysicists, we wanted to have a really flexible prior, because we didn't have a strong idea about the probability of a biogenesis. And just to give you some motivation for this, I went and spoke to some expert researchers in origin of life. So people who are chemists and biochemists, who actually by virtue of their professional training and their research, ought to have a much more informed professional idea of what the probability of life arising from a prebiotic chemical soup is. And I asked people, "Suppose you didn't know that life showed up within the first few hundred million years of earth being habitable. You didn't know anything about paleogeology or biology. What would you estimate just based on what you know about biochemistry is the probability that life would show up within, say, a billion years, given early earth-like conditions?" And I got answers that ranged from 95% on the high end to 10 to the minus 70 on the low end. With that kind of range, if that's the range that exists among experts in the field, then certainly Ed and I didn't want to be putting our thumb on the scale with whatever prior we choose. And what we found was that people had taken a look into this question before, but they hadn't done it from a Bayesian perspective. They had drawn a Bayesian conclusion, namely trying to estimate what actually is the probability of life arising, but they did it from a frequentist analysis. So they weren't clearly distinguishing between what's the prior and what's the weight of the evidence. It turns out that they had sort of a hidden prior in what they were doing that guaranteed the answer that they were going to get. And so what Ed and I realized was that with the previous analysis that had been done, so they estimated that there was like a 90% chance that life would arise within a billion years on the basis of the early emergence of life on earth. But what we realized was that even if they hadn't gathered any data, in other words, we didn't know that life arose within a few hundred million years on the young earth. That didn't go into their calculation. Their prior or their posterior rather on account of their prior still would have said there's a 90% chance that life shows up within a billion years, because that answer was baked into the form of their prior. So it was not a very helpful way of setting it up because they had an overly informative prior. So when we used to an appropriately uninformative prior, what we found was that the fact that life showed up fairly early on the young earth does increase our posterior estimate of the probability of a biogenesis. However, it's still completely consistent with a biogenesis being exceedingly rare. Like we could be the only life in our galaxy. We could be the only life in our observable universe. And that is not at all inconsistent with life having shown up within a few hundred million years. And in some sense, some people when I've talked to them about this say, well, it's kind of obvious. Like, why are you writing a whole paper about this? You have one example. You can't draw like any strong conclusions from one example. And I get that except if you just pretend not that there would be any way of knowing this, but if you just pretend that there were really strong evidence that within, say, 10 years of the young earth becoming habitable, there was life. So, like a biogenesis takes like 10 years, you know, you bake a beaker and fill it with distilled water and a few chemicals. And like 10 years later, life shows up. Then we could be totally confident, even though we had only one example here on the young earth, you could be very, very confident that there's life elsewhere in the universe. The key is that a couple hundred million years is like fairly short on a biological or geological time scale, but it's not that short. And so it's completely consistent with life being extremely rare. So, that's sort of the depressing part. If you're an optimist about aliens, which I am, in the sense that, like, I hope that we're not alone, that there's some sort of cosmic party that we can someday tune into. Then we have more confidence that that's actually the case because of what happened on the young earth and what we've learned about what happened, but we can't be totally confident. However, the good news is that it's an empirical question over the next maybe 20 or so years as astronomers build bigger and better and more sensitive telescopes. And as we refine our ability to infer composition of exoplanets, planets around other stars from spectra, it's something that we're going to be able to actually test. And, you know, as we survey planets around other stars, if we find strong evidence of a biosphere, that'll be, first of all, incredibly exciting. The most exciting discoveries in human history just to find a single other example of life elsewhere beyond our solar system. But also, it turns out that just finding a single other example dramatically increases our posterior estimate of the probability of a biogenesis. And so we find even one other example that'll make us virtually certain that there are many other examples in our galaxy and, of course, in the whole universe. Yeah, with that in mind, it's, in my opinion, a very worthwhile search to be looking for signs of intelligent life or a biogenesis elsewhere in the universe. Two episodes ago, I talked with Jake about the transit method, but I believe there are other methods some that we could get into here, for example, spectroscopy. Could you describe a bit about that process of the search? The transit method, which you talked about with Jake, is how most exoplanets that are being discovered now are found. But when the first exoplanets were found, some people say the first planet around a sun-like star was found in 1995. I think it's actually arguable that the first one was found in 1988. But anyway, sometime in the past, like, 30 years or so, we first started... I'm not an observer, but astronomers first started finding exoplanets. The first several hundred that were found were almost entirely found by something called the radial velocity method, which is done with spectroscopy. So the idea there is that if we take a spectrum of an object and we measure that spectrum over time, we can see if that object is moving relative to us. The way we can do that is by looking for a Doppler shift in that spectrum. So the classic example is you're standing on a street, and a siren on an ambulance is coming towards you, and you hear it at a high pitch, and then as it goes past you, you hear it at a low pitch. So it goes, "Noooo!" And that's a Doppler shift of sound. The same sort of thing happens with light. And so there can be absorption features that we kind of fix onto with, like, a ruler in a spectrum. And those absorption features, due to elements in a star's atmosphere, end up shifting back and forth over time, or they can shift only in one direction. They can be shifted to slightly higher frequencies, or to slightly lower frequencies for some stars. And so the stars where they're shifted to higher frequencies are stars that are moving towards us, and stars where it's shifted to lower frequencies are stars that are moving away from us. But some stars have a periodicity to their Doppler shift, and sometimes they seem like they're moving towards us, and sometimes they seem like they're moving away, and they do this in a periodic fashion. Nobody can think of any reason why a star on its own would sometimes move towards us and sometimes move away. But the key thing is that what we're measuring, we're not measuring the whole position of the star, we're just measuring the radial component of it. In other words, how it's moving along our line of sight. So if you imagine that for some reason, the star is moving in a little circle, or a little ellipse, then sometimes in that circular or elliptical motion, it'll be moving towards us and sometimes it's moving away. And so that's a model that would explain the data that we see, if only there were a reason why a star would sometimes move in a circle or an ellipse. And it turns out that there is a reason, which is that there could be an unseen planetary companion that the star is orbiting. Usually you think about planets orbiting stars, but actually what happens is since both objects have non-zero mass, they're both orbiting the common center of mass. And so the planet moves a lot more than the star does, but the star also moves. For example, in response to the Earth, the sun is moving at around 10 centimeters per second. So if you took out all the other planets in our solar system and you just had the Earth moving around the sun, the Earth is moving around the sun at like 30 kilometers per second. But the sun would also be moving, or is moving, and is moving around the center of mass of the Earth's sun system at a speed of like 10 centimeters per second due to the Earth. If there's a more massive planet, then it'll make the star move faster. And starting in the late 80s or the mid 90s, astronomers started monitoring a lot of stars, looking for these periodic shifts in the star's velocity. And I swept over a lot of potential complications with this. It's a really, really hard thing to do. But astronomers were able to get down to precision of a few meters per second, which is astounding when you think about the fact that, you know, that's sort of the speed of someone walking down a hallway, and we're measuring the speed of the whole star relative to us with that precision. Even though there are giant convective plumes rising and falling in the atmosphere of the star at speeds of like kilometer per second, but they sort of average out and we can still measure how fast the star is moving on average. Also, when other complication is, as I mentioned, the Earth is moving around the sun at 30 kilometers per second, so much, much faster than a few meters per second. So, astronomers had to be really careful and subtract out the influence of the Earth moving around the sun. And in fact, this actually ties to something else we talked about earlier, which is gravitational time dilation or gravitational redshift. So, another source of changing the apparent frequency of a source of light is not just motion, but also traveling near a gravitational field. So, another thing that astronomers had to take into account is how the distant star's light is traveling with respect to where the sun is. Because if that light passes near the sun, then it ends up being affected just by passing near our star. It can be gravitationally influenced, and the location of spectral features changes depending on how close to the sun, that sun being from the distant star ends up passing. So, there are all these complications that go into it, but astronomers have figured out how to handle a whole lot of them and get down to several meter per second accuracy by the mid-90s. And they're now down to less than one meter per second accuracy, and the hope is that in the next 10, 20 years or so, we're going to get down to having close to 10 centimeter per second accuracy in order to see the evidence of Earth-like planets at Earth-like distances around sun-like stars. That's incredible. So, I don't know who to properly attribute this to, but I learned about it from the writings of Carl Sagan that there's a good idea we can look for the composition of what we're seeing in spectroscopy from other astro-bodies that we'd expect. A signature of abiogenesis would be that biological creatures require oxygen and emit methane at rates that we wouldn't find naturally otherwise in the universe. Do I have that framed correctly? Yeah, yeah, so that's exactly right. The kind of standard ideas by measuring a spectrum, this time not from a star, but from a planet near a star, we want to identify what's in the planet's atmosphere. And so, this is another really hard, all these things from modern astronomy feel kind of like science fiction. I was trying to give you a flavor of all the difficulties with measuring the massive planets, which is what the radial velocity method does. Now, if we want to measure the composition of a planet, the standard example you've probably heard is like looking at a firefly next to a spotlight. The star is just so much brighter than the planet, and it's contributing basically noise that we have to subtract out to get the planet's spectrum. And that's the reason why we need really big telescopes and eventually probably really big space telescopes to look for evidence of a biosphere, but exactly what you're getting at. What are we actually looking for when I say we're looking for a biosphere? The basic idea is we want to look for so-called biosignature combinations of molecules. So, the idea of a biosignature or a biomarker is it's a chemical or a combination of chemicals that could arise as a result of life and could not arise or would be very unlikely to arise without life. So, the classic example, which Carl Sagan spoke a lot about and wrote a lot about, is exactly as you said, the coexistence of methane and oxygen in a planet's atmosphere. Because it turns out that the typical lifetime, if you just put a whole lot of methane and put a whole lot of oxygen in a planet's atmosphere and then let it go, the methane reacts very quickly and very easily with the oxygen. And so, you only end up continuing to have both methane and oxygen in the planet's atmosphere for like a few years or a few decades or so. After a few decades, you're done with one or the other, whichever there was less of. So, if you put more oxygen in than methane, all the methane ends up reacting with the oxygen to form carbon dioxide and water, and then you're left with carbon dioxide, water, and oxygen. And oxygen itself actually is very reactive, even without methane, and so that'll continue to react with other things. So, the idea is that you need both the oxygen and the methane to be continually replenished at a pretty fast rate in order to explain an observation that we see, which we haven't seen yet, but that's what we're going to be looking for, of both methane and oxygen in an exoplanet's atmosphere. And so this is kind of the classical idea of what the biosignature would be. If we see both oxygen and methane in a planet's atmosphere, a lot of astronomers would say, that must mean that there's life. And there are other potential combinations of chemicals that could count as a biosignature as well. But this actually gets into something else that I thought about with Hannah Ryan and Yuka Fuji. Yeah, this is where I was going with it, actually. Yeah, there's an exciting potential telltale that is. There's a reason to be skeptical, I think. Is that correct? Yeah, that's exactly the idea. So, fits in well with the theme of the podcast. The idea is, so suppose you need two chemicals for a robust biosignature. Some astronomers think just having molecular oxygen in a planet's atmosphere on its own would be enough of a biosignature, because oxygen is so reactive that if there weren't something continuously replenishing the oxygen, then it would all disappear. We wouldn't see it. So, I mean, that much is certainly true. And the only thing that some people can think of that would continuously replenish it fast enough is life. And it might be true that the only way to get a lot of oxygen in a planet's atmosphere and a continual rate is with biology. But if that's not the case, and actually a paper came out just in the last couple weeks, suggesting a new abiological mechanism of producing oxygen in a planet's atmosphere, a few people have talked about this. So, I'm not sure that the score is totally settled that molecular oxygen on its own is a biosignature. And if it's not, you need, say, methane or some other chemical together with oxygen in order to infer the presence of life. Then you can never definitively infer the presence of life, or at least this is what we argued, because there's a potential source of false positives. If there were something else, some other body that has one of the chemicals, say, methane together with a planet that has, say, oxygen, and we didn't know that there were two bodies in the system. So, for example, we gave was a planet and a moon. So if the planet has oxygen and the moon has methane, then looking from many light years away, we might not know that that planet has a moon around it. We just see that there's both oxygen lines and methane lines in the spectrum coming from where we know that there's a planet, because we see, say, a 10 centimeter per second radial velocity wobble in the star's velocity, or maybe the planet transits. If the planet transits, I should point out, then there's a better chance that we might notice there's a moon, because we could see potentially that the moon is transiting also. But if it's not a transiting system, and most systems like the Earth around the Sun will not be transiting, the probability of an Earth Sun clone transiting, if it just has random orientation, is only about one in 200. So if it's a non-transiting system, there's basically no way of noticing that there's a moon. But when we look at the spectrum, we could see both oxygen and methane, we would infer, ah, oxygen and methane on the same object, there must be life. But it turns out that it was not on the same object, there were two different objects that were just near each other. Very interesting. And I can't recall if it was that paper or another, but I had another note of something particularly interesting you'd written about that I wanted to ask around was that if you even imagined that we had an idealized telescope. So, you know, forget Kepler and Hubble and James Webb, like the best thing we could put up in space, potentially, and we could observe, there's still some upper bound limits on observability. Do I have that correct? Yes, that's right. Basically, there's a certain expected rate that we would be getting photons from a planet, and that has to do with how bright the planet is, and that has to do with how, maybe how hot it is, or it essentially comes down to how far it is from a star and what kind of star it's orbiting. So, if it's sort of in the habitable temperature range, like 300 Kelvin, sort of like the Earth is, so, you know, sort of normal temperatures for life as we know it, then that tells us basically what the production rate of thermal photons will be. I should mention what thermal photons are. Everything that has a non-zero temperature is emitting photons simply on account of having a non-zero temperature. So, the sun is emitting photons in the visible part of the spectrum, and also in the UV and the X-ray and the infrared, so it's emitting photons in a lot of parts of the spectrum, but most of its photons are in what we call the visible part of the spectrum. And it's no accident that we call it the visible part of the spectrum, because we're creatures that evolved on a planet around the sun. So, of course, we evolved eyes that can see the kind of photons that the sun spits out the most of. It's sending out this distribution of photons that peaks where our eyes can see, again, because that's how our eyes evolve. Light bulbs also, again, it's no accident, spit out photons that peak in the visible part of the spectrum, incandescent lights also put out a fair amount in the infrared, but we don't see that. People who are a factor of like 20 cooler than the sun, so the sun is like 6,000 Kelvin, people are somewhere around 300 Kelvin. We are also emitting a thermal spectrum, but we spit out most of our photons at a wavelength that's about 20 times longer than where the sun's photon spectrum peaks. So, the sun spectrum peaks at like 500 nanometers right in the middle of the visible part of the spectrum. The spectrum from a person or from the earth itself peaks at around 10,000 nanometers, about 20 times 500. So, everything is emitting light. And if we're looking at thermal photons, then that just depends on how hot the planet is, which, again, if it's earth-like temperatures, we know will be around 10,000 nanometers. If we're looking at reflected photons, those will peak right where the sun's photon spectrum peaks, so around 500 nanometers. But regardless of where the photons are distributed in their spectrum, there's a certain expected rate. And it's basically a stochastic process that produces the photons, ultimately comes down to quantum mechanical transitions in either the planet or the star that's producing the photons that eventually we end up seeing. So, it's not an exactly regular rate, but there's an expected rate, and there'll be variations around that. And those variations are described by Poisson statistics. That imposes a fundamental sort of noise floor into how accurately we can measure anything, like the brightness of a star or a planet. Or if we're measuring a spectrum, basically what we're doing is measuring the brightness in many different color channels or many different wavelength channels. It's basically like carrying out many observations of how bright something is all at the same time, measuring how bright it is from 500 to 600 nanometers and 600 to 700 nanometers and so on, and that gives us a spectrum. In each of those bins, there's an expected rate, which is what we're trying to determine by taking the observation. We're trying to estimate what is the average rate of photons that we're going to see. And then within that distribution of average photon rates, we can look for things like absorption features or emission features due to molecules or elements. But in order to look for those, we need to have a good estimate of what the mean or the expected or average rate of photons in each frequency bin is. There's noise being added to this again by just personal statistics. And so if it's a pretty dim object or if we want to split up the light into many frequency channels, then there's going to be a small expected rate. So we need to observe for either a really long time or with an enormous telescope in order to drive the statistical uncertainty down to a low enough level where we can make an accurate determination of whether there are molecular or atomic features in a spectrum. We have sort of a fixed size of telescope like it's probably going to be really hard for us to have a space telescope bigger than like 10 meters. That tells us, you know, if an object is like five or 10 or 30 light years away, it tells us the minimum amount of time that we need to observe that object for in order to have a reasonable chance of determining spectral features. The fact that even with a really large space telescope, a planet around a star that's like 10 or 30 light years away is not trivial to see, especially when you fold in the fact that we're not looking at just the photons from the planet. You know, as I mentioned earlier with the firefly next to the spotlight, we're getting noise not only from the planet itself, you know, its own personal statistics, but from the personal statistics from the much brighter source right next to it, which is the star. And so the whole thing is that it's really helpful if you can block out the light from the star and astronomers are working on some pretty sophisticated ways of doing this, where you can dramatically reduce the glare caused by the star and focus on just the photons from the planet. But even when you do that, you know, you can't like look for just a few seconds and understand what the spectrum is because there's a lot of statistical uncertainty. So you have to observe for a while, even with a fairly large space telescope that doesn't have, you know, noise from the Earth's atmosphere, whatever, like it's the cleanest observing environment we can get. And it's still a very difficult measurement to get a good estimate of the spectrum. Yeah, it definitely sounds like it. Well, maybe to wrap up, you could share a little bit about your story and moving into being a data scientist. So I had a great time working in academic astrophysics. I did it for a number of years over a decade and got to work with some really incredibly brilliant people. Some of whom I mentioned in this podcast, and I feel really lucky to have had the opportunity to do that. But after a while, you know, I got into astronomy thinking about what got me interested in the first place was the possibility of life in space. After a while doing astronomy, I guess two things happened. One was I started to think that the probability that we're going to actually find life on a planet around another star is a lot lower than I thought that it was when I initially started doing astronomy. So this actually kind of relates to the monkeys on keyboards calculation. So in what I told you about earlier, from paleogeology, like we don't have any reason to be confident that there's life commonly in the universe. And furthermore, what does it take to get life started? Well, it takes like getting a bunch of RNA or DNA base pairs in the right sequence or just nucleotide bases. So this is a problem kind of like the monkeys on typewriter's problem, like how long a chain do you need to get to get life started? And I won't go through the whole calculation, but I did a calculation of this. And it seemed to me a lot less likely that nucleotide bases would come together in the right sequence in order to produce a self-replicating information bearing molecule given a prebiotic soup than I used to think was the case. And I could be totally wrong in this, and I hope I am, because I do hope that the Earth is filled with aliens. But what I was thinking is that if there's a reasonable chance that we are alone, not only in our galaxy, but in our observable universe. And if that's the case, then the whole project of looking for exo life is a little less interesting to be a part of. Now there's, anyway, I found astrophysics fascinating even if we never find life elsewhere, and I hope that we do. But so that was one thing that was going on for me, and the other thing was I started thinking that, you know, I have a lot of interest, and I'm interested in people, as well as things that are like light years away from Earth. And I started wanting to have my work have a little more relevant to humanity. So I started looking at what a lot of my astrophysics friends have started looking at also, which is the field of data science, which is basically modeling, or a large part of data science as I understand it, involves modeling human behavior. And it seems to me that that's another really interesting thing that you can spend your time thinking about. And it turns out that the tools of astrophysics are a pretty good preparation for data science. You learn how to deal with statistics and uncertainty and large data sets and making statistical inferences from data and so on. So I took a look around and I didn't know if I would even be employable beyond academia. But I applied for a few jobs and I ended up finding a job at a company called Project Florida. It's a kind of funny name, but what we were doing was building a wristband, something kind of like a Fitbit. So it would help people monitor their activity and we were also measuring heart rate. And by measuring heart rate pretty accurately, we were able to identify people's calm and stress events. So as you go during the course of your day, sometimes you feel stressed. And it turns out when you feel stressed, the way that your heart beats changes. You might think, "Oh, your heart rate goes up." But it turns out that's actually not the telltale feature. It turns out that that can happen. But the really telltale thing is when you're stressed, your heart beats like a metric, like perfectly regularly. Whereas when you're calm, there's a lot more variation in how your heart beats. I found this really counterintuitive when I started working on it. But turns out it's true. And a kind of funky thing is that our company, Project Florida, ended up meeting the fate of most startups. And we ran into funding problems and we ended up going out of business at the end of May of this year. On the day that the CEO sat us down and told us that unfortunately we were out of money and we were going to be out of business soon, all of us in the company, we were wearing prototypes of our device. We never ended up actually selling the device, but we had prototypes. And our back-end system later that day registered a lot of stress events. Like all of us there in that room were feeling stressed. And when I saw that the data had indicated that it was kind of heartbreaking, that we weren't able to bring this product to market, that worked really well. But unfortunately, to have something successful in business, you need not just a product that works well, but the finances need to work out also. So anyway, that was an awesome transition from astrophysics to data science. After that, I ended up finding a job at a company doing something totally different, a company called Stitch Fix. It's a subscription clothing company. We specialize in women's clothing. Our goal, so we have professional stylists who pick out the sort of clothing to be mailed to our clients every month. And the professional stylists come in with a huge amount of expertise in what sorts of clothing looks good with particular clients, what sorts of combinations of clothing work well together, and so on. We want to rely totally on human expertise, where human expertise is much better than a computer. However, we also want to make use of the data that we have available, where we think that can help our business. So we have a very active data science team at Stitch Fix. This is doing things to try to make the work of the professional stylists easier, where we can try to estimate what sorts of merchandise is doing well, and what sorts of merchandise should we have more of, and stuff like that. And I'm learning a whole different aspect of data science from what I was learning at Project Florida. And it's a, I don't know, I've been on the job for about a month right now, and really enjoying working on the team there at Stitch Fix. Well, excellent. You guys run a really prolific blog that I like to follow. So it seems like you're a monk's good company. Yeah, we try to, obviously, any data scientist makes use of a lot of open source software and Wikipedia and Stack Overflow and, you know, Statistics, Stack Exchange. So every data scientist makes use of a lot of stuff that's publicly available all the time. And we want to make sure that we're contributing to that world of knowledge that other data scientists can look to. And so, yeah, I haven't written any blog posts myself, but we have a pretty active blog there, as you said. And every week or so, some data scientist at Stitch Fix will write a blog post on some machine learning technique or statistical inference technique that he or she has found to be particularly useful in doing our work and try to, you know, explain it so that both the broader community at large, and in particular, other data scientists can use it and find it helpful in their own work. Awesome. Well, Dave, I've really enjoyed our chat. I want to thank you, first and foremost, for your peer review, if you will. I think it was a very good commentary. I'm glad we revisited the topic and, as well, sharing all your expertise in Excel plan exploration and what have you. So thank you so much for coming on the show. Absolutely. Had a great time. Thanks for having me on. Excellent. And until next time, I want to remind all the listeners to keep thinking skeptically of and with data. [MUSIC PLAYING] (upbeat music)