Data Skeptic

Big Data Doesn't Exist

Duration:: 32m
Broadcast on:: 06 Nov 2015
Audio Format:: other

The recent opinion piece Big Data Doesn't Exist on Tech Crunch by Slater Victoroff is an interesting discussion about the usefulness of data both big and small. Slater joins me this episode to discuss and expand on this discussion.

Slater Victoroff is CEO of indico Data Solutions, a company whose services turn raw text and image data into human insight. He, and his co-founders, studied at Olin College of Engineering where indico was born. indico was then accepted into the "Techstars Accelarator Program" in the Fall of 2014 and went on to raise $3M in seed funding. His recent essay "Big Data Doesn't Exist" received a lot of traction on TechCrunch, and I have invited Slater to join me today to discuss his perspective and touch on a few topics in the machine learning space as well.

(upbeat music) - Data skeptic features interviews with experts on topics related to data science, all through the eye of scientific skepticism. (upbeat music) - Slater Victoroff is CEO of Indico Data Solutions, a company whose services turn raw text and image data into human insight. He and his co-founders studied at Olin College of Engineering where Indico was born. Indico was then accepted into the TechStars Accelerator Program in the fall of 2014 and went on to raise $3 million in seed funding. His recent essay, "Big Data Doesn't Exist," received a lot of traction on TechCrunch. And I've invited Slater to join me today to discuss his perspective and touch on a few topics in the machine learning space as well. Slater, welcome to Data skeptic. - Thank you for having me, Kyle. - Oh, it's my pleasure. So, yeah, I first kind of caught wind of you and your company and everything that was going on when that TechCrunch article came out. - Oh, thank you. - Your anecdote rang with me. It's an experience I've had where you have your customers commonly exaggerating how much data they have. - Absolutely. - I'm curious, if you have any thoughts on why this is so common. - Yeah, if I had to guess, there was some transition that happened in the last couple of years where data went from something that was a need to have and maybe a little cherry on top for legitimacy to something that could, in many cases, entirely replace it. So, you had companies showing up that if you looked objectively at their revenue numbers, at their user growth, were not very attractive companies, but people sort of wised up to the fact that if you added data to that, it became the magic sauce that turned an average company into the next Google. And I think that kind of spiraled out of control and everyone sort of saw that more data was better and when you saw the rise of Hadoop and Spark and Yarn and all of those frameworks that made it very easy to deal with large amounts of data, people saw it very much as a badge of honor, as a sign of legitimacy to be working with more data. What I found is that when you actually dig behind the curtain a little bit, the number of data stores out there that are on the scale that people typically refer to their own data, you know, terabytes and petabytes are just very few and far between. Even if you looked at all of the text data out there presented on every social media channel, Facebook, Twitter, Tumblr, discuss, WordPress, and keep on going down the road, you might get to terabyte level data. You're still not gonna hit petabytes, even with all of the text data that exists in the world. I think there's a piece here where the reality and the perception have kind of diverged. People hear those petabyte and exabyte numbers all the time and they assume if they're not at that level then they're missing out. >> Yeah, it makes sense. There's something sort of intuitively appealing about more is better, right? But I'm also a big fan of the law of diminishing returns. So I'm curious, how do you imagine looking at what's the right size of data for a company or an organization to hold? >> So I think it depends very much on which side of the equation you're on. There's the consumer of whatever machine learning algorithm is being produced and there's the trainer of whatever's being produced. We have a rule of thumb that we use internally. We're on the training side where we're building these models ourselves, which is that for an image model, you need approximately 10,000 examples to get a reasonable signal. On the text side, across the board, we usually just add another factor of 10x. Text is just generally much harder to work with than images are. That said, I would say on the consumer side, if you're not the one actually building up those models, then you can take that number down quite dramatically. Is that ideally you should be able to get actionable insights on a few hundred examples. If you're going north of that, obviously there are a lot of benefits that can come from large amounts of data analysis. But we found that once you get to 100, you're really getting a large portion of the analysis that you'll get going forward. And it's much more useful to, instead of just having 1,000 tweets from a person, instead of 100 tweets, having 100 tweets from 1,000 users. >> Yeah, it makes a lot of sense. It seems to me that in the big data discussion is also kind of a distraction from the value we might find in small data. Perhaps this is a bit philosophical, but I'm curious to hear your thoughts on how we measure the value of a data set, both big and small. >> That's an excellent, excellent question. It's a very hard one to answer appropriately. The truth is that people today still have no clue what their data is worth, and it has very little to do with size. The best example that I can think of, and it's one that I always go back to is clickstream data. And clickstream data is a common example where people have gone out and gathered terabytes and petabytes of data, and sort of been left ringing their hand saying, "Now, what on earth do I use this for?" On the flip side, right, even though that is an extraordinarily massive data store, it's worth almost nothing, almost nothing. On the flip side, look at ImageNet. And ImageNet is approximately 1,000 images and 1,000 categories. So it's about a million examples. It's a few dozen gigabytes. The value of ImageNet, just because every image is tagged with that metadata of, oh, this is a picture of a dog. This is a picture of a cat. ImageNet has fueled image research in academia for close to a decade now. The value of that is an uncountably priceless data set, even though it's far smaller. So I would say the value of the data is not related so much to size. It's far more related to the quality of what's in the data set. So it's say that for a certain quality, obviously more is better, but the value that you get for ImageNet going from zero to a million is far more than you'd get even going from a million to 10 million. If you wanted to do that same value again, you'd have to go from a million to a hundred million examples. And then it would be maybe twice as useful as the underlying, but it again, gets very much into that log diminishing returns, as you said. - There's a paper I frequently quote by Banco and Brill from Microsoft Research titled, - "Scaling". - Absolutely. - Yeah, so you know it. Scaling the very, very large corporal for natural language disambiguation. Given your expertise in NLP, would you maybe summarize the paper and share any additional thoughts you have on the result? - Absolutely. - So the paper is relatively simple. They take a number of NLP algorithms and see effectively how the accuracy increases with larger amounts of data. Big surprise, what they find is that the more data they use, the more accurate the models get. What you might not realize and what most people don't realize is that the amount of data that they're dealing within that paper, even at the very summit of what they're doing is still very, very small. The maximum model that they train is on a billion tokens, which sounds like a massive number, but if you start putting that back into bits and bytes, it's less than half of the size of Wikipedia. It's really just a tiny portion. So even when they call that very, very large data, the first thing that I would say is, yes, you definitely do get more value, the more data you put into something. But already at that very small scale, when they're talking even through 10 million or 100 million tokens, you see the return sort of flatten off. The curve that everything really starts to take is that you see a very quick acceleration to a decent level of accuracy. And then you see a very slow progression from there. You've got this sort of linear increase in accuracy for an exponential increase in data. So usually what you'll see is if you 10x the amount of data you have, you'll consistently get another 4% accuracy. And if you 10x it again, you'll get another 4%. That would actually, that would be quite good, but the idea is it's very clearly diminishing returns and it becomes much more about not, how do we end up getting more data? I mean, that is obviously a huge problem, but you very quickly after one or two orders of magnitude, just get beyond the amount of data that exists. So it becomes all about pushing that linear growth as far down the data pipeline as possible, which is why you see a lot of people moving to these algorithms that can make better use of larger data. But even again, and I just need to say, because at that extent of large data, the biggest text models in the world today are barely at the edge where it makes sense to use Hadoop for people where that makes sense. - Yeah, there's always that overhead, I guess. - Yeah, yeah, absolutely. I mean, so Hadoop's own internal metric is that unless you're dealing with at least five terabytes of data, you're going to see better performance out of Postgres or MySQL or a more traditional data store. We've run a lot of tests internally that show that to be approximately true. There's a lot of niceties that Hadoop really gives you, but I think it's a place where people have let the buzz get ahead of the actual value that it provides. And people use Hadoop all the time on text purposes, but to be perfectly honest, you're not doing yourself any favors unless you're working at a Twitter or a Facebook where you actually are dealing with terabytes of text. - Yeah, it makes sense. In a similar vein, a lot of deep learning researchers claim that someone impressive advancements in their area are due in large part to large labeled data sets. - Yeah, yet as you pointed out in your article, that can be really prohibitively expensive, so. - Absolutely. - Curious if you have any high level advice for an entrepreneur or CIO trying to right size their data strategy? - Well, let me see. You have to hustle. You really have to hustle. There's no free lunch when it comes to data out there. The advances that we've found in deep learning, getting to larger data sources are few and far between, not because the advantages aren't there, but just because these large labeled data sources are exactly, as you've said, prohibitively expensive. The honest advice that I would give any company trying to train their own models, trying to right size their data solution is that if you're relying on crowd labelers, you're probably making a mistake. Academic sources, you know, five, 10 years ago because they've gotten nice grants and can sort of afford to throw money around, have gone through and sort of made these data sets for us already. A, you can train on a lot of those and sort of prototype and see how much value that extra data gets you. The other piece just going back again to that diminishing returns is as an individual, you can label a thousand pieces of data and it's not going to cost you that much money. Moving from 1,000 to 100,000 is going to cost you, if you're a really small company, approximately all of the funds you have at your disposal. And chances are that you can get that amount of data, get from that 1,000 to 100,000 by doing data mining, by just thinking very carefully about where that data exists. So honestly, the advice I would give to entrepreneurs is, don't crowd label, not only is the quality low, but the expenses extreme. You're only really going to be able to have a tractable problem as if you can find an existing data store out there that works well. And thinking very cleverly about where that data store can come from is difficult. - You know, it's interesting, I've had similar experiences with having grand hopes that I could use mechanical Turks to label my data sets and having that not quite workout and saying, well, let me do like a best three out of five. And that was maybe a tiny bit better, but just not perfect. - And then you've already spent five times the money. - Yeah. And I'm not trying to pick on the Turkers in particular, but is there something wrong with crowdsourcing? Do we need to have experts, you think, with backgrounds in taxonomy doing labeling? Or is there a middle ground? What's one to do if they want to create a labeled data set? - So I totally see where you're coming from and I would flip that on its head. There's an example that I always go back to. Earlier in my career, I did a lot of work in education technology. Education technology is no stranger to taxonomy. So I was dealing, I was working at Pearson. Pearson was trying to do this whole push to move a lot of their printed resources into the digital world. And one of the pieces that they did in order to make that transition a reality was they went out to a whole bunch of teachers. They called these people, you know, subject matter experts. People that theoretically know very well how coarse material comes together. And they tried to make this big taxonomy. It was a prerequisite graph saying you have to learn pre-algebra before you learn algebra and breaking that down into a couple thousand very, very modular pieces. But here was the fascinating thing. When you made this dependency graph of what you had to learn first and you got cycles. And I don't know exactly how technical all the listeners are but it means it's impossible to learn. It means that fundamentally the experts did not know what was happening and the experts didn't agree. So what I would say is that if experts don't agree what's the chance that an average person on the street is going to? And I would even take that a step further to say the solution is not to try to improve that quality but instead to learn how to perform with lower quality data. So there's been a really large migration that we've seen from the so-called gold standard corpora, sorry, that are human labeled, you know, often with three to five layers of checking and double checking. That was really good for the old style of machine learning where you want to be very sensitive to individuals or people who are sort of going after 100% accuracy. But we've realized that that is just not the case. Humans are not 100% accurate and it's a flaw to believe that they are. So I believe instead what you should do is learn how to deal with messy data. This is another piece that's really driving forward. A lot of pieces of deep learning is that when you look at models that are robust to bad pieces of data, deep learning typically comes out ahead of the competition. Not always, you know, in the world of machine learning. There are no absolutes. But you see that these more general purpose models that aren't necessarily relying on handcrafted features but are learning a lot of that internally, learn how to filter out the noise on their own. And people are actually shockingly bad at filtering out that noise themselves. So as a rule of thumb for sentiment analysis, we say people are usually about 95% accurate. We have a model that hits 93.5%. And at that point, clearly the problem is not, we need to go out and have more examples of 95% accuracy. It's much more, we need significantly more lower quality data. - Yeah, that makes sense. Maybe we could unpack that a tiny bit just for people who might not be totally familiar with sentiment analysis. Could you walk through what the problem is and why it would be that humans are only at 95%. - Absolutely, so the problem of sentiment analysis is taking a piece of text and guessing whether it is positive or negative. So the best example that I gave is, if you look at Amazon, imagine reading a review and guessing what the start rating was. That's effectively sentiment analysis. It's something that humans are not surprisingly really quite good at. It's something where we can guess with about 95% accuracy how many stars someone's actually given a rating. But the issue is that there's a people call it context dependency, people call it subjectivity. There's a hundred different names for it, but the truth of the matter is, it's not a definitive one to one mapping. I might think that the word incredulous is extremely positive. While someone else might have a much more negative connotation to that and just because I was incredulous at what this product was actually able to deliver and I mean that very positively, someone else might come along and be like, "Wow, this person cannot believe this was such a bad product." Interpret it slightly differently, so you see those errors. It's usually more around those neutral classes that you see people getting really confused where IMDB movie reviews are one of the canonical problems in the field, one of the academic data sets that we were talking about earlier. What you see in some of these reviews is these long philosophical diatribes about how the movie made them recall their childhood, but maybe their childhood wasn't that good or was a little bit dystopian. And at the end, you're sort of left, okay, there were some positive points and there were some negative points and I didn't understand a third of what he said, so three stars? The person's like, "No, actually I quite like the movie. I gave it seven stars." There's enough subjectivity in a lot of these problems. You can start actually asking a question which gets a little bit philosophical, whether the goal of the machine learning algorithm is to attempt to reflect some kind of objective reality or attempt to mimic whatever people say. And I think by and large, as an industry, we've gravitated towards the second option of those two. - Yeah, you'd said something, I can't recall if it was exactly in the article or elsewhere in my research, but an astute comment I thought was that bigger data sets aren't necessarily what's better, but having metadata is gonna drive a lot of value. So we've talked a bit about labeling. Are there other aspects of metadata that you guys have been able to leverage that brings enhancements to your analysis? - This gets a little bit into the technical details, but a lot of what Indico really believes in is this new motion kind of within the deep learning buzzword, if you will, called transfer learning. And the concept of transfer learning is that the majority of what a machine learning algorithm is learning as you teach it is not actually specific to the problem itself. There's a useful, it's another paper I can send it to you afterwards. It's actually very fascinating. For a long time, there is this data set that's just cats versus dogs. Believe the highest accuracy, we were able to achieve on it was about 85%. The problem is show a computer or a picture of a cat or a dog and it has to guess which one it is. Extremely hard problem, 85%. We were really quite happy with that. Then what someone did was they sort of looked at the internals in the model and they had this idea and they said, okay, I'm gonna take a different data set. So they used ImageNet, which is a thousand classes of planes, trains, cars, boats, also a lot of cats and dogs for what it's worth. But they trained a model on that and it didn't do extremely well, but that wasn't really the point. What they did was then they took that model that was trained on ImageNet and they took all of the internal states. They took basically how this model had learned to look at images, basically the higher level features that it recognized and they basically chopped off the input and output layers and they said, okay, now we're going to give this a much smaller portion of the original cats versus dogs data set. And what they found was that the accuracy actually skyrocketed is that I wanna say with approximately a hundredth of the original data set, they were able to hit 95% accuracy. So we'll have to double check those numbers. At least 80% confident that those are the right numbers. - That's fascinating. - You know, obviously with more data continued increase, but the idea here is that when you take something, a convolutional neural net specifically in this case. What happens is that the top layer of the convolutional neural net is learning to recognize these very low level features, you know, granular areas, smooth areas, nothing that really is understandable to a human. But then you stack these layers and you stack these layers and you stack these layers and you know, this is why it's called deep learning is you keep stacking, keep stacking. Basically each level, you get a little bit higher, little bit more abstract and a little bit closer to something that really matters. So you get from, oh, this is a cornered detector, this is a curved detector, if you, you know, really like you get a circle detector, all the way down to, oh, that's a striped detector. That's a tail detector. This is a feature that responds to how cat-like the images. And what they found is that once you get to that, that last layer where you're responding to these really high level features, those features can actually be used for a wide variety of tasks. You have to use far fewer images to train your model because it already understands how to look at images. And in a very kind of roundabout way, what you can use is any metadata is really useful, is that if we take that original case of the image, if we want to build a beach classifier, let's say, the original way you would think to look out at the problem is, okay, let me find 100,000 examples, 10,000 examples of beaches and not beaches, forests, deserts, things like that. Deserts might be a little hard. But what this sort of this trans-learning approach says is that all metadata is useful. You can go out and you can train a model on a thousand different classes, none of which are beaches or forests or anything even related to the problem that you're working on. And actually use what you've learned classifying on that other metadata to apply to your original problem. So, you know, you could go out and let's say scrape a bunch of Tumblr blogs and Tumblr's actually pretty restrictive, so it's probably not too easy to do. But get the tags associated with all those images, say, okay, I'm gonna build a classifier on that because that will teach my model to recognize things that are intuitive to people. And then use what that model learned, even though it was very different metadata and use that to train that beach classifier that you originally wanted to make. I don't know, that's a sample of really what that metadata gets used. That allows your machine learning models to learn in a way that even though it might not directly be the problem that you're solving, it's very critical to the underlying state of the model. There's all sorts of other pieces around geolocation data and timestamp data that can be useful for fundamentally different kinds of analysis. The point is text data in and of itself is useless to anyone doing training. If you wanna do interpretation on top of that, well, you know, that's where you turn to an Indico and we don't necessarily need the metadata 'cause we're giving you new kinds of metadata. But the short is without metadata, there's shockingly little value. - Yeah, let's talk a bit about Indico. Maybe to start, can you tell me what inspired you and your co-founders to start the company? - To be honest, it was a little bit of an accident. I mentioned that it worked at EdTech in Pearson. And one of my really, really claims to glory was I came up with this system that hit a state-of-the-art named entity recognition. It's certainly not state-of-the-art any longer. But then I got a little bit jaded because I had sort of built this thing and put my heart and soul into it and they got locked up in a Pearson IP vault, which is, they're right, but it was just discouraging to me. And then my co-founder found me and he was thrilled. We had just found this new site, a catalog. You know, a catalog. - Sure, of course, yeah. - So this was one of the very early whale sound competitions and he reached out and he said, "Hey, I heard you know a little bit about machine learning. "I'm trying to use a neural net to recognize "where in these recordings there are and are not whale sounds." I didn't actually know neural nets. I was a big support vector machine guy back then. I sense kind of changed the camps. I gave him some advice, but he was so passionate about it that he drew me back in. And actually, this co-founder is Alec Radford. A lot of people I'm sure know him. He's very popular in the deep learning communities. But the point is he was so passionate that even though we had no idea what we were doing in the early days, we're like, "Look, we have to do something here." - Yeah, so we started recognizing the same problems again and again. And our first step was just saying, "Okay, we don't want to build the same thing again "for multiple customers." But then we realized that that really betrayed a much deeper problem, which was that there was such a deep lack of knowledge and understanding about machine learning. And this was totally new for us. We had been so immersed in this world for such a long time is that we assume kind of everyone knew about it. And this was something that had given us so much, so much joy in our lives. And it was something that we both believed in really deeply. And so the mission of Indigo became sort of as we experienced those first couple of contracting jobs, how do we educate people about this? How do we get this out to the world? How do we build tools that actually will allow a typical developer to interact with this without having to know the details of a support vector machine? And luckily we were developing in Python, so a lot of these pieces were easy, but a lot of this also came right as these major libraries were being established. You know, when we first started work, Fiano was something with the 50 GitHub stars. And we started really exploring this world. And that's kind of the trajectory we've been following ever since. It's just we want to bring machine learning to everyone. It's something that's given both me and my now three co-founders immense joy in our day-to-day lives. And it's something that we immensely enjoy looking into and working on. It's something that we want to bring to the rest of the world. And we see Indigo's best way to do that. - Excellent, so can you tell me a bit about your specific services? How can people leverage you guys? - Yeah, absolutely. So we focus on text and image analysis. At the very highest level, what we're attempting to do is take text and turn it back into people. One of the things that we say internally is that the digital age turned people into 140 characters. Indigo is how you go back. Given some text written by a person, the idea is Indigo can return to you all sorts of actually useful human information about them as far as their political alignment, what they're interested in, whether they're talking positively or negatively. So sentiment analysis is kind of the tip of the iceberg, if you will. The main verticals we found this to really apply well in this. Marketing is a very obvious one. Is a lot of people making marketing tools are interested in not just knowing, oh, tweet at 8 p.m., don't tweet at 2 in the morning, but I actually want to know, hey, your followers are really interested whenever you talk about investment. Especially when you talk about negative investment, when you have really negative sentiment associated with that, you should find more things to talk about in that realm because people seem to be enjoying it a lot. That's sort of what we offer on the text side. On the image side, a lot of those transfer learning pieces are really what we're playing with. We've found that it's actually possible for an individual, you know, once we've sort of gone out and gathered very large data sets with useful associated metadata, we can sort of make this really powerful base model that will then allow people to train models with a tiny, tiny amount of data. I mean, we've put together demos where you can get models that will tell the difference between two people from less than 10 images. That one's very much in beta, but it's something that we're exploring with a couple of select companies. It's really what we see as one of the big futures on images that sense it is so cost-prohibited to get these large data sets. The idea is that we will spend the time and energy to get them and attempt to create models that will allow our users to not have to. - How big of a company would I need to be before I'm past the barrier to entry to try out your services? - So that's actually one thing that we feel very strongly about is any one can try our services. From an individual working in a dorm room, actually we found that a lot of our tools are easy enough to use that even high school students have used them to win hackathons. All the way up to Fortune 500 companies. We really do believe that the core abilities of machine learning are something that's applicable to everyone. The only difference is that bigger companies have more data. - So certainly I expect the data scientists to be able to make use of APIs. How accessible have you found your APIs to be for software engineers without a background in machine learning? - So that's almost everyone who ends up using our product. We care very, very deeply about usability. One of the things that made us very frustrated is when we were first getting into this space, it's not like there was no one else having offerings like this, right? There's Lexilidix, Symantria, Alchemy API, a few of those other big players, but the usability was just awful. Lexilidix in particular has this huge Qping system where you're waiting several seconds to get back an API response and you have to write 30 lines of Python to actually get the thing to work. We cut basically all of that out. A, we made everything available through a very simple call and response REST API. We're like, okay, that's a good first step for usability, but we decided that wasn't enough. So then we went and made client libraries. And what the client libraries do is that for most major languages, we support six today. We actually have a wrapper for that REST API that will allow you to interact with the API just as you would a normal function call. So the idea is you call indicoio.sentiment of whatever string of text you want, and it just works. Our typical integration time with a customer is less than a day. We have a lot of traction at hackathons just because it is so easy to get up and running. And I can tell you, these are not people with degrees in machine learning. These are not people that even understand what the models are behind the scenes. And that's exactly what we think is so powerful about it. Is that without having to have any understanding of the underlying models that go into it, you can still understand the value and you can still use something that's high quality. And that's something that we didn't find that was being offered. - So what's on the cutting edge for you guys? What's coming next? - There's a lot of really cool stuff kind of on the horizon. One piece is a large push for multi-lingual support. I'm sure you're aware, but almost all of the literature today in machine learning is English centric. Maybe you get lucky and find a paper on part of speech tagging in Arabic, maybe something in Chinese, but by and large, it's all English. And there's been very little study into how that actually scales across languages. And a lot of companies have actually intentionally used older techniques because the path to multi-lingual support is clearer. What we found is that you actually don't have to do that. We've gone out and we've found our own sources of relatively high quality data that is natively produced by people in those languages. So we plan on continuing that push into the multi-lingual realm, not just adding more languages, but also adding multi-lingual support across multiple APIs. Today, we support sentiment analysis and keyword analysis in approximately 10 languages. So Arabic, Chinese, Japanese, Italian, German, Spanish. I'm sure several others, I'm forgetting English, of course. And we found sources for several others. So that's one big piece. The other piece around visualization and data ingress. And what we found is that the predictive piece of the model is good and useful. And those APIs are very, you know, we still believe they provide a huge amount of value. But what we've also found is that almost everyone has two issues still. And these are really at the core of machine learning as a field. One is data ingress, saying I have data somewhere else. How on earth do I turn it into a reasonable format? And Twitter is a really great example where everyone wants to run these algorithms on Twitter data. There's a lot of very useful information trapped there. But the API is awful to work with. GANIP is significantly better than the public Twitter API, but a lot of people have huge difficulties sorting through it. And even more so if you go into individual surface providers. So, you know, I talked about Twitter because marketing was really the first area where we went. But we're also doing a lot of exploration into customer support. And in customer support space, learning how to integrate with the Zendesk or help scout API isn't that trivial actually. And so making that process much easier, we see something that's very valuable. In addition to the data ingress ability, we also believe that visualizations are extraordinarily powerful. So we actually have one demo up for this already. It's an indico.io/thumpprint. What we believe is that having the human insight is step one. Step two is actually translating that into something that can be understood by totally non-technical people. So we're actually making a series of APIs available that will not only do sentiment analysis, but also return very attractive visualizations based on that sentiment analysis. So you could, say, inject your customer support data and immediately get a mat back showing you the hottest pain points in your product. So you could actually feed in a person's Twitter handle and get back a report of which subjects they are most active in and what the sentiments they associate with each one is. And I suppose the third area is something that we've started calling a near human accuracy is that we're generally very dissatisfied with the quality of services offered today. If you look at a lot of very common players, they're talking accuracies in the realm of 75% percent of an analysis. And that's just, it's not useful if you actually want to do analysis on small amounts of data. We see that improving the accuracy of models to be an absolute requirement to facilitate the use of small data. And so we're now striving for what we call near human accuracy for all of our APIs, which is something that performs similar to a human in the majority of cases. So the idea is if you look at the response of this, you say, yeah, that's pretty close to what I would have said as a person in the majority of cases. - Excellent. Well, be sure to put some links in the show notes to some of the VIST stuff you've mentioned if it's available as well as for how people can sign up and try out some of the APIs if they're interested. - Absolutely. - Cool, Slater, thanks so much for joining me today. I think this has been really interesting and I'm actually gonna go around and play out with a few things and maybe put some stuff in the show notes that I tried out. - Ah, great, I'm glad. I'm glad to hear that. - I need any help, just paying us. We have a, I happen to still manually mend support and we respond in under 24 hours in 99.7% of tickets. - Fantastic. All right, well, thanks again, have a good day. - You too, Kyle. - For more on this episode, visit datascopic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music) (upbeat music)