Archive FM

Data Skeptic

[MINI] Structured and Unstructured Data

Duration:
13m
Broadcast on:
21 Aug 2015
Audio Format:
other

Today's mini-episode explains the distinction between structured and unstructured data, and debates which of these categories best describe recipes.

[ Music ] Data skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes just like this one. Our topic for today is Structured and Unstructured Data. So data tends to be put into these two bins, structured and unstructured. Do you have any sense of what those are, Linda? Structured means it has a framework to work on and unstructured means it doesn't. But what does that mean for data? Well, that's a good question. So for data, it basically means when it's structured that at a high level is organized in some way. I would put some other constraints on that. Some of them are like hard constraints. Some of them are just sort of usually true. First and most importantly that the data is machine readable and machine interpretable so that a computer can know how to open the data and look at it. That's the acceptable part and then interpretable part means that it's organized in some meaningful way that information is structured in there. Imagine if you had some data like, let's say you've been logging your bike miles, do you do that? No. Well, if you did and you wanted it in a file, what do you imagine that file might look like? Just the date and then the miles. Yeah. Maybe time. So date, miles, duration, and then on each line, then the next line would be the next day, something like that. Yep. Yeah, see, that's very structured because the information is all clear how to read it. An unstructured version of the same data might be if you just wrote down in English, like today I wrote 6.5 miles. It was hot out, you know, and then the next day you wrote like, oh, I can't believe I had a flat tire today. I wrote this far and I got it at this location. And the same information could be there, but it's no longer machine readable because now it's in English. So even though you and I can get that information, the data is not in a structured way. So does unstructured mean it can be put in a structured way? Oh, good question. Yeah. A lot of time is spent turning unstructured data into structured data, or at least extracting information or some representation of information from unstructured data. So I got a couple more things to say about structured data. It generally has some metadata describing it or usually like a schema. So it says what's contained in the data, like how many fields there are, what their valid values are, whether that means like it can be true, false or male, female, or it can be some integer value from 0 to 10. Because all these things are there, it makes the data processable and searchable and you can do lots of cool stuff with it. A good example of structured data or the classic example is relational databases. Never heard of. You never heard about like developers you work with talking about tables and schemas and MySQL and stuff like that? I talk about that, yeah. Yeah, that's structured data. That's relational databases. There's other types of structured data too, like a lot of XML is structured data. And one of my favorites is RDF, the resource description framework, which is a subset of XML and it's very highly specified schemas about how to store data to consistently describe often complex ideas. Let's talk more about unstructured data. Maybe photos. A photo, one could claim it's structured data in that basically the photo is stored in a file format and the computer knows how to open that and render it on the screen. So the pixel data is structured. There's no ambiguity about like how wide the image is, how tall it is or what color goes where. But let's just say you want to know something about what's in the image. Unless there's some extra metadata like someone has tagged that photo and it's hidden in the metadata of the file, then it's totally unstructured. The faces in it, what people are doing in it, there's no way a machine can extract that data directly. But eventually when they do facial recognition, does that mean they have extracted it into structured data? More or less, it depends on what they're doing. So like let's say what you wanted was a database of photos that did and did not have faces. And you had really good facial recognition software and you ran across all your photos and then you mark them, you know, true or false with has faces. Or maybe you mark them with a number of faces and zero what obviously means, no faces. Then that's a process of extracting, as long as you store it in a structured way, you've extracted structured data from unstructured data. Which data scientists spend a lot of time, not about a lot of time, but a fair amount of time extracting structured information because most of the machine learning algorithms or statistical approaches we want to take will require data to be well structured. So there's often a process of cleaning things up. Some other examples of unstructured data are like application logs. So every like web app or mobile app has the option of writing log information and usually that's just a developer that's putting a message in there. There doesn't have to be, although there could be a format to what those messages look like. So power points, audio, audio is a good one. There's all types of structured and unstructured data. So let me ask you a few quiz questions. Which do you think is more popular, structured or unstructured data? Unstructured. Unstructured? Well, yes, no, I guess that's the correct answer I was looking for. But now that I've thought about it, my word choice was wrong. What's more popular would imply what do people like? And I think people would tend to like structured data because it's easier to work with. But more common indeed, it's much more common for data to be unstructured. Do you have any guess as to the ratio? 10 to one. 10 to one, that's not a bad guess. I hear people sometimes guess 80%. Yeah, I don't know. It's really hard to measure that. But I think that's a good guess. So what do you think about data in Excel? Would that be structured or unstructured? Oh, that's structured. Yeah, almost certainly structured. I mean, you could also go into Excel and just put a bunch of garbage in there. I guess you could put a diary entry in Excel. Or you could take pictures of your data and paste them into Excel. And then you'd have unstructured data that just happens to be in an Excel hell. But by and large, yeah, if data is in Excel and someone has done even a minimal amount of keying it incorrectly, that's structured data. What about our recipe? It seems structured to me. They have instructions. They have measurements. And then what it is. I picked this one on purpose because it's a little tricky. The answer is it really depends. If you just have a book of recipes and well, it's actually assumed you OCR of them. So you have the full text of all the recipes. The data is actually unstructured because even though they're like ingredients, they're not in a schema. They're not laid out into. What do you mean? There's an ingredient list. They're not like in a schema. It's just words. They'll say like two and a half cups of flour. But some other recipe could say flour. 2.5 cups. Oh, seriously, it has to be told exactly the same way. Well, so a list of ingredients. Well, that's an array. That's a list of ingredients. And a single ingredient is a description of what it is. A unit of measure and a numeric quantity of the measurement. So like flowers, the description, unit is cups and 1.5 is the numeric value. If someone were to parse out all the ingredients in that way. So let's go back to your bike data. You wanted to record the date, the duration and the distance, right? Yeah. Do you think we should always do that in the same order or just mix them up? I'm pretty sure you do it in the same order. Yeah, exactly. So the same if you had structured recipe list or ingredient list data would be the same. It doesn't actually matter as long as it's specified in advance. Like description, numeric value, unit. That would be okay. Recipes could be partially structured data. But then you get down to like how to cook it. And it's usually sort of conversational, right? Yeah, I don't know. I think instructions were meant to be written away. That's very succinct. Obviously, instructional conversation means two way. They're not asking your opinion. Like if you feel like it, please pulse the eggs in a blender or something like no. It's like if you want to make the recipe, they're instructing you to do this. So I wouldn't say conversational, so no, I disagree. Well, there's a little bit of structure in that they're generally procedural. One could even call it an algorithm because it's just steps. It's a process you execute. But if it were to have structure, it would need to be more quantifiable and also kind of searchable. So if the recipe instructions were totally structured, I'd be able to query your database and say, bring back any recipes that require stirring. But you'd have to like parse the text and say, does this mention stirring? How about stirs? How about stir? And it would be like this natural language processing project. So that for me is why it's unstructured because you can't sort of query against it. You can't take statistical measurements on it. You can't say like, what's the average cook time? There's not a field that says cook time unless that was part of the metadata. Yeah, I mean, do you want to define what you mean by field? Ah, good question. So field is a distinct part of structured data. Structured data is a collection of objects or records. And those records generally have a set of fields and maybe nested fields. A field is a single element and that element has a known type where type is sort of like the specification of what it can be, like it can be true or false or it can be a date or it can be a number or a decimal or an integer or something. So the unit's always the same? Yeah, we would call that the data type. And sometimes it would have some constraints on it like it can only be of a couple of values like red, green or blue or it can only be of a certain minimum or maximum value. So an example I thought that was maybe if there's one takeaway for remembering what structured and what's unstructured, have you ever filled out an application on a clipboard for something? Yeah, long time ago. Yeah, what? Probably like voter registration. Registration? How about a credit card application? I haven't filled out one of those in paper. I don't think ever. Really? I guess, yeah, you're right on the line. You might be. Maybe. Maybe once, but I can't remember. I only remember doing it like once or twice and you're not that much younger than me. So that we might just be on the tail end of that. Anyway, you filled out certain stuff and there's like a fixed number of blocks and you have to write like one letter in each block and then like when it comes to the date, it'll be block block at slash, block block slash that like forces you even to put the data incorrectly. You know what I'm talking about? Sure. That is a good analogy for structured data. There's no, obviously you can have bad handwriting, it can be hard to read, but more or less there's no way to not rigidly put in the correct values into that form. Whereas if instead the form was like, tell us in an essay why you think you should win this car or get this credit card, that would be unstructured. And actually, I think, so this is an important distinction for people to know, but more and more, I feel like we live in this, what I would call semi-structured data. And I don't think I'm coining that phrase, but it comes around a little bit more. Like recipes is a good example. The ingredient list could be very structured, but the instructions list, the steps is pretty much by definition unstructured, unless you came up with this crazy sort of hierarchy of how to specify every possible action people wanted to take to capture it and stuff like that. Well, I have to have, you know, a noun, a verb, you know, not grammar person. Yeah. Yeah, so you can detect the grammar, but you can't necessarily read the meaning. So you're saying you can't think of a way? Well, I'm debating this actually now. It's kind of interesting. I feel like if I started this project after a week, I'd have like 70% of the cases covered, but I feel like I could never finalize it to cover every variation of what you'd need to do in a recipe. There's always going to be something weird, right? Some step that is totally unique. Yeah, I mean, I feel like some of the cookbooks have some personality, like, and now celebrate, kick back and relax or something like that, you know. See, that's, wouldn't you call that conversational? I mean, some are, but the recipes I read are... Vindicated. Minor. Where do you get your recipes? Just somewhere online. I'm precarious. Yeah. Yeah, that's where I've seen most of your links coming from when we're going to cook together. Anyway, thank you, as always, Linda, for helping me think through how I would describe structured versus unstructured data, and hopefully people always also consider that some data is semi-structured. And there's a lot to be said for how we store and persist this data and how it got to be structured or unstructured in the first place, but those are topics for another day. And until that day, I want to remind everyone to keep thinking skeptically of and with data. Thanks for joining me, Linda. Thank you. [music] [BLANK_AUDIO]