Data Skeptic

ContentMine

Duration:: 53m
Broadcast on:: 28 Aug 2015
Audio Format:: other

ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. The program's founder Peter Murray-Rust joins us this week to discuss ContentMine. Our discussion covers the project, the scientific publication process, copywrite, and several other interesting topics.

[ Music ] >> Data skeptic is a weekly show about data science and skepticism that alternates between interviews and mini episodes. [ Music ] >> Peter Murray-Ross holds a doctorate from Oxford University with interest in crystallography and informatics. He is currently a reader emeritus at the University of Cambridge, and formerly a senior research fellow at Churchill College. In addition to his work in chemistry, he's also known for his advocacy of open access and open data, which led him to found Content Mind, a project that we'll discuss today, which uses machines to liberate more than 100 million facts from scientific journals. Peter, welcome to Data skeptic. >> And thank you very much for inviting me. >> I'm really glad to have you. Well, maybe we could jump right in and you could share the purpose and the mission of Content Mind. >> Well, first of all, Content Mind is about justice. The access to scientific information in the world is fundamentally unjust, with only a very small proportion of the world's population being able to access any significant amount. And Content Mind is there to redress the balance as far as it can. More generally, we're creating information from the whole of the scientific literature, making it available to everybody. In this process, we hope to build a community. We think communities are very important in the current century, and this will be a community of not only scientists, but also anyone interested in what people call curious minds, and they will come from all walks of life because they're keen to get scientific information and make it useful. Finally, much of the information is currently only available for sighted humans. If you cannot read it, you cannot access it properly, and we're interested in increasing the machine processability and accessibility. >> Yeah, I think this is a fantastic mission, and in my opinion, a necessary one in the current day and age. For anyone who might not be familiar with scientific journals in the process of publishing, could you summarize how scientific findings go from a researcher or a laboratory where they're discovered into academic publications? >> Most people are either funded by a number of public or sometimes private funders or their students studying for masters or doctorate degrees. Regardless of that, every scientist should keep a lab book where they record everything they do, both their observations and also their conclusions, and when they feel that they have got something that the world needs to know about, they then write a draft paper. Now, this is usually done in a group. So it normally includes a supervisor, some colleagues, and maybe people from other institutions. So although there are a number of single author papers, most of them are published by a group of people. They write a draft when they're happy with it. They choose a journal that they think is useful to publish it in. They submit it. The journal sends it out for review to a number of reviewers, somewhere between two and five, who then make comments. Now, sometimes the comments are that this paper is not acceptable for publication, in which case the authors may challenge this and you will get an extended debate. Sometimes the reviewers suggest minor corrections and the authors correct them. Sometimes the reviewers say that there should be more experimental work done and so on. Finally, the authors will have a draft, which the journal will accept, and then the draft is submitted to the publisher, who then turns it into a production copy. Now, that's the current position. I have serious concerns with some of it. The idea that you only publish at particular points during research is something which is increasingly out of spec with the way that the century thinks. So, in software, we publish our software several times a day to repositories. And there's a growing movement of people, myself included, who want to do open notebook science, where the notebooks are visible to the whole world, as the experiment is done. And we hope to do this in content mind. We hope that anybody can see what we're doing at any stage, comment and get involved. But the mainstream at 99.9% of science is done in groups who do pieces of work, write it up and then send it off to journals. I should also say that publication is not guaranteed. Many journals have a rejection rate of somewhere around about 50% or more, and some of them reject 95% or more. The rejection isn't always because the work is poor. It's because the work doesn't fit the excitement that this journal requires from its authors. And again, many of us feel this is a huge waste of effort to require people to resubmit papers several times until they find a journal which finds their work fit for publication. Interesting. The open notebooks is a specially novel concept I feel. Although I could imagine you can't speak for every scientist, but is there a concern that someone working on interesting stuff might have their work, you know, sort of scoped out from under them by someone who's able to work a little faster because they're sharing notes? There is certainly concern about that. It will take a long time before the majority of people come around to an open view. And indeed, it may well be that the majority never does come around. But one example which is being pursued at the moment is Matt Todd, who's a chemist in Sydney, Australia. And he's running an open malaria project developing new chemical compounds to fight malaria that's fighting the parasite of malaria. This project is run completely openly. And as a result, he has a lot of people contributing who wouldn't normally be involved in a scientific project. Yeah, that's a very valuable trade-off, I would say. Let's for a moment make the obviously incorrect assumption that people have access to scientific publications, which I guess would usually mean I either have a printed version of the journal or a PDF or postscript file. And those file formats are pretty easy view with free software. What else might someone want to have available that isn't readily available in that final publication version of a paper? PDF is the normal mechanism that people use to communicate their published research. I note that it isn't what they use in Theses, for example, where they use Word as an authoring tool. The more physics and maths communities where they use LaTeX. But generally, the result is a PDF. Now PDF has one advantage, which is that it looks pretty and is immutable. It's often referred to as the version of record, but it has many disadvantages. It's a very difficult standard to standardize. So having a PDF doesn't mean that you can cut and paste from it. It could mean that it's simply a photograph and a scanned copy of some type script or whatever. So it's a very, very general term indeed. The main things that it doesn't offer are machine processability. So machines generally can't process PDF and extract useful information from it. Since it requires human eyes, it can be very slow to access the data. To give an example, we have one group we're collaborating with, which has to read 30,000 papers a year for systematic reviews of trials in the literature. And that means one paper every three minutes. Now, since some papers are 20 pages, you can see that this isn't effective. And so we're building software which can read this and come up with the key points in two or three seconds. The final thing you don't get is supporting data. So PDF is not a good way of publishing data, even if the data is in the PDF, you don't know where it is from a machine point of view. It can't be cut and pasted. Some journals, not all, support this with supporting data or supplemental information. We are able to download this systematically. And that's often very valuable. Yeah, absolutely. From what I understand, there's also tools with a table, I guess, if I was had infinite time, I could copy, you know, sort of transpose a table. But figures leave me often wanting the data that's there. Are there tools available that can help me liberate the data that's behind a plot that appears in a document? This depends on the format that it's in. And so we should say that documents can range all the way from handwritten or tight script manuscripts through to text in PDF through to machine processable files. It depends very much what they are. The most general ones that we deal with in the last 15 years of publication contain PDF text, which is extractable. But here you have to remember that PDF has no sense of order. So if I take the word table, you read this as T-A-B-L-E, right? And in Word or HTML, those letters are produced in that order. In PDF, they can be in any order. It's where they are on the page that matters, which means that in the first instance, you have to create a tool, which works out where they are on the page, and if there's enough space between them to separate words or not. So it can be very difficult. When you come to tables, it's even more difficult because you've got things in columns, but there's nothing in the PDF that says this is a column. Sometimes you have some lines between columns and that helps a lot, but often they're just justified with white space running down the columns. Many people have thrown themselves at extracting tables. I would estimate that probably several hundred person years have been spent in the world on trying to extract tables from PDFs. Yeah, I've contributed some of that time myself. Yeah, exactly. Can you share a little bit about what tools are currently available if someone is struggling with that at the moment and they might be able to leverage some of the things you guys have developed? It's probably a good idea to come and find our tools because we deal with most of the cases. The problem is that some PDFs are quite easy to deal with and some are extremely difficult. And until you're familiar with that, you may flounder. So what you will find out by visiting us is, is it only available as a bitmap or raster? Does it have characters inside it? Does it have any form of organized order and spacing of characters in the words? Are the figures present as a bitmap or as a vector diagram? And the latter is really valuable. If you actually have vector diagrams, which you do, if you create the documents in Word and in LaTeX, then you can get a huge amount of data out. But unfortunately, the publication process often turns these into rasters because the publishers think that's a better way and that destroys all the vector information, making it much less valuable. Interesting. So while we're on the topic, is contentmind.org the best place to go for people to learn more about the tools and resources available? Absolutely. We started off by running workshops and the idea of these, we still do. But we started off by saying, to find out how to do content mining, you need to come to a workshop and we will show you the tools and so on. However, our tools have progressed, our documentation has progressed, and we've been delighted with how quickly people pick this up so that we're getting an increasing number of people who are coming directly to the site and finding out how to do things. As a result, we are frantically writing documentation to cover this new set of community who comes straight to us and uses a site rather than workshops. So I believe you coined the phrase "the right to read is the right to mine" and personally I agree with this statement very strongly. Can you share some of your thoughts on why you think this is the case? There are ethical and moral aspects to this. First of all, I actually think that everybody should have a right to the published scientific literature. There was a case earlier this year where the Ministry of Health in Liberia said that they'd found a paper, 35 years old, which if they had known about it, could have been used to prevent or limit the Ebola outbreak and they simply hadn't been able to find it because it was behind a paper. That's the most important thing. I would say there is a right to read the scientific literature. But having said that, if you have a right to read it, and by that I'm in a legal right, then we believe you should have a legal right to mine it. Unfortunately, a lot of content owners, as I would call them, are contesting that and saying if you have a legal right to read it, then you have to pay us more money for permission to mine it, even though they hadn't even realized five years ago that this was a valuable thing to do. So they want to create a new market out of simply sitting on their established information. It's rather like people who make new movies out of creating clips of old movies or mashing up music. That is that the original owners will say okay well you've got more money because you're doing more and exciting things with it. So that's why we challenge it the phrasing of this comes from an important protest movement in the 1930s in UK, where people were not able to walk on the mountains of Scotland and England so they went on a mass trespass. And their phrase was the right to Rome, that became embedded in law in the UK, and I'm very pleased to see that. And the right to mine is becoming used in the discussions of legislators in copyright and content mining. Yeah, I was very disappointed in my life when I left university, unbeknownst to me, I had benefited from all these payments that my university was taken care of, and as sort of an independent person I no longer had all the access I used to. Now I don't want to put you in the sort of awkward position of being asked to defend a point of view you don't agree with but why don't we have more scientific access is there any legitimate reason someone might think that pay wall needs to stay in place. The only reason that anybody cites is that we need a well funded publishing industry to communicate science properly. Nobody except a very few argue that pay walls are limited to those people who want to know so we've had it argued that only doctors should be able to read the medical literature and that pay walls are a good way of limiting it. Pay walls are not set up to do this they were done there to raise revenue for the publishers, and they've never been used and I don't think they ever should be used as a way of controlling access. I should say that there are certain types of information where it's legitimate to control them but not through pay walls. Those are things like sites with human data, sites with rare breeding animals, possibly things related to certain types of security problem and things like that but generally there's no excuse in not making it available to anyone. And if for example you are a patient with a rare disease you may very well know more about that disease by reading the literature than 99.9 of the medical profession. Yeah and the Ebola example you gave is especially striking one it's sad to hear that human suffering is a result of these pay walls. You wrote or were an author on that I'll link to in the show notes titled responsible content mining you outlined a number of really good best practices for people who plan to crawl or otherwise assemble large data sets. Can you share some of that advice. So content mining is meant to be a universally accepted legal operation. But there are some places where it isn't yet legal or where it's a grey area but in the UK it is now legal to carry out content mining for non commercial research purposes. We want to stay within the law this is not a war against publishers but it is a challenge to those people who wish to limit us carrying out our legal rights. And the first thing is that we should have good web server etiquette. In other words if we're going to read something from a publisher website then we should give it a relatively low load. We shouldn't hit it with something that could be seen as a denial of service attack. In practice an interval of a few seconds is more than enough. Because if you're doing this from a number of publishers you can alternate between publishers and so forth. In practice the amount of load on publisher servers is minute compared with say a single mention in Reddit or slash dot which you know can bring a publisher site down. There's no excuse for that claim that and it will burn their service out. The second is that we should have good scientific practice and good manners. So if we mine something from a site we should say where it came from. And if there's a danger that this doesn't represent the purpose of the site properly then we should go out of our way to make that clear. So for example if we mine something from a site and let us say there are a few errors in what we do, particularly if we're doing OCR. Then we should make it clear that this is not the definitive version that it will contain or it may contain up to let us say 1% of incorrect characters something of that sort. In which case the readers know that we are not claiming that this is the precise article that if you needed to go into court to justify something you know copyright plagiarism or something you wouldn't do it with our copy and so on. With collaborating publishers we have no problems and I think good manners and scientific practice will ensure that the final thing is observing copyright. Now this is very hard to be 100% compliant because copyright is so complex and it differs in every country. It is unclear for many documents whether they are under copyright and if so what. So we take the view that there are copyright publishers, there are total access publishers where copyright is non permissive and there are open access publishers where copyright is permissive, normally CC by. And if we have CC by then we can download and do more or less whatever we like with the paper so long as as I say you know we respect the scientific process. When we are downloading something we are potentially infringing copyright because we're copying it, even if we never make it available to anybody else. And in the UK we have the right to do this. Now we have to be careful that large numbers or indeed any copyright material doesn't by mistake get published because we're not in the business of being pirates. Although some of the publishers believe that mining is a threat in this direction if the pirates want to copy the whole of this they almost certainly have already done it. So it's a red herring to introduce piracy but we would try and make sure that anything we do was either held on one side or more likely was actually only copied transiently processed and then discarded. I'm glad we're talking about some of the legal challenges because I think it's in my opinion ambiguous what is or is not permitted and maybe some of that is because of technology changing so fast but there's also a precedent I would say at least here in the United States there was the case of Aaron Schwartz I would imagine you're familiar with who was. Yeah University student who was just trying to download and liberate if I recall correctly was JSTOR papers and was brought very strict and stringent criminal charges against him. So that's kind of I don't know if that was done deliberately to say this is how we're going to handle things but that's sort of the precedent do you think that people are trying to scare individuals away from data mining. Well I hesitate to speak authoritatively on American law and justice. My own view is that this was deliberately over the top that it is unclear whether Schwartz had committed any offense in fact. He was never brought to trial he was arrested by or you know by the federal authorities and not by the civil courts or civil processes. And there's some suggestion that the prosecutor wanted you know to get you know wanted to make a thing of this either as an example or for their political career I can't comment on that but I'm simply replying on what other people have said. But it is certainly as I say unclear whether Schwartz committed a crime or not or even whether he broke copyright. So I'll go on to the things that we have to be careful about. First of all I don't think any of this constitutes a criminal offense. There are things that happen with a DRM which I believe are potentially criminal offenses so if for example we try and break DRM on some of these sites, then it may be that that we can be perceived by the government and authorities or not. I'm also speaking here generally because laws in different countries are different and you have to realize that this is incredibly complicated because there are probably 100 different countries jurisdictions to deal with and some of them are incredibly arcane. Generally the question of three pieces of law that you have to worry about. One is copyright. The second is contract law and the third which only holds in Europe is the so-called sui generis. Let's start with copyright. Copyright says that you do not have permission to copy a document unless you obtain the permission of the copyright owner. In some countries there is a doctrine of fair use which shows that you can copy bits of it and so on but even in those cases it can be difficult to know whether you have the right to copy it. We in Content Minor only copying it for the purpose of mining it for research purposes and we are only copying it temporarily so we're not republishing copyrightable parts. Now of course nobody knows what is a copyrightable part of a document. We have the general doctrine that facts are not copyrightable that's in the burn declaration so if I say the temperature outside is 22 degrees Celsius that is not a copyrightable statement because there's no other way of expressing it. If on the other hand I say oh what a lovely morning I might be infringing Rodgers and Hammerstein. So you have to realize that we're sticking with fact your material which is not copyrightable. It's unclear for example whether an abstract of a paper is copyrighted some publishers probably will claim it as copyright so we do not reproduce abstracts by default it's that sort of thing. Because it's unclear what generally happens is that if a publisher challenges something then they can go to the DMCA and ask for a takedown and this is given automatically and the alleged offender has to take down and has to argue their case as to why this is not in fact an infringement of the law. This in my view is vastly weighted towards the copyright owner it's guilty until proved innocent and in some countries like France if you offend three times as alleged by the owner, then you can be banned from the Internet, the so-called Hadopi law. Yes, so it is a very difficult area, however as I say in the UK we have legal right to do it. The only other country which has this right is Japan. In the US there are I would say de facto rights but they're not necessarily legal rights so fair use is something that you may well be able to argue but Larry Leslie has said that it's the right to call a lawyer in your defense you know it's no stronger than that. It might very well mitigate your your sentence or whatever. The second is contracts now most universities have signed contracts with publishers which forbid text and data mining and they have a phrase something like notwithstanding x you may not crawl by the index bulk download etc etc etc. And the universities have by and large scientists now first of all I think they've been highly irresponsible in doing this they haven't brought this to public attention. It's against natural justice and secondly in the UK the new legislation expressly says that that has no legal force so in other words we are going to mine stuff from the University of Cambridge regardless of what has been signed with the publishers and we have the university library on our side with this. We cannot be held that we're breaking contracts because Hargree has explicitly said that the new legislation overrides any contract. Exactly what it does with pseudo DRM material we don't know. And the final thing is sui generis which is only applicable in the European Union sui generis was passed about 15 years ago the database right and it says there's a database and its contents are protectable by this law effectively copywriting it and this might mean that something like you know a collection of telephone numbers was copyrightable in Europe but not necessarily copyright copyrightable in the US. Again I think it's a highly skewed law I don't think it brings any benefit of what it does is holds back progress. We argue and is not being tested in court to my knowledge that a journal is not a database you know it is a collection of documentary material for other purposes and is therefore not protectable by sui generis. I would suspect just me guessing that the reason for any protection like that would be if let's say I spent a year of my time and went out and did field research and built up that data that I should have maybe some oversight on the data not just that authors have submitted their papers to a publisher who now kind of is building a wall around them. Do you think that's a fair perspective on things. It's a very commonly held one and I think that the conventional way of doing science is if you collect data. You have a right to use that data until you've extracted anything useful out of it and then republish it. That view is under great threat at the moment and rightly so as one example in the UK. The scientist at Queen's University Belfast had collected data on pre rings dendrochronology over 30 years was asked for this date he was retired, asked for this data over freedom of information requests and the university declined to release it. The freedom of information commissioner information commissioner then overruled this and said that in fact the data belong to the university and not to the scientist and that the university had a duty to release this data so you can see that the balance is changing. It's also true that funders are extremely keen that their data should be made available. In my opinion, the accessibility of research results should be dictated by the researchers themselves and perhaps to some degree by those that fund them. Yet it seems like the publishers are having the most control in that situation. Do you find that to be correct and also what's your personal opinion about whom should have control over access to scientific literature and data. Well, I am not as strong on saying that the control of access to research should be done by the researchers themselves particularly if it's publicly funded. As you probably know in this country, we had a big public storm called climate gate where people wanted climate data from the University of East Anglia to reanalyze and the university declined to let them have it and there was a great deal of bad feeling emails which had unfortunate sentiments in them and so forth. Now, my view would be that if that project were being funded now, the funders would probably put much more stringent explicit requirements that data be made available. Yeah, that makes sense to me and I think maybe as the open access movement takes off, we'll see projects being funded from the start that way and perhaps that'll enable better collaboration and community efforts. One such project that's caught my attention is the Text to Genome project which annotates the human genome with, as I understand it, papers relevant to specific areas. So if one researcher is kind of looking at one part of the genome, they might learn who else is published about that area and I could see how this would be tremendously valuable to them. I think that can only be available due to data mining and that sort of annotation. Do you see that as a success in the same way I have and are there similar other success stories you're aware of? So there are a lot of people in, so I know the people involved in that and Max Housler who's now at San Diego and Casey Bergman who's at Manchester and yes, a very useful project. There are a lot of people who are looking for textual annotation of biomedical literature, roughly 50/50 between the new genome stuff and medicine in general. So lots of people want to analyze the literature to annotate genomes because the sequence of the genome is known but not necessarily all the functions of it, what all the regions of it do and so on. And this is often described in free text in the literature so that you get something like this region regulates the expression of some protein or whatever, and that's in textual format. So you want to be able to tie that protein to that region of the genome. That's one thing, at the other end people want to look through medical records and clinical trials to come up with patterns of disease or treatment or whatever. So we're working with the clinical trials group, open trials to look at how we can help with, and also with the Cochrane collaboration, to see how we can help with systematic reviews of the medical literature to pull out. Those pieces of papers on trials which are sufficiently valuable to be systematized into a resource. So I noticed there's a lot of things in particular the archive coming out of Cornell that's been a great resource to me personally and I think a lot of other people. My sense is that the sentiments that content mine has are starting to become much more popular and that we're seeing perhaps the beginning of an open access movement. That doesn't mean we don't have a long way to go, still liberating data from paywalls and wild garden communities, but do you think we're the scientific community in general is starting to be on the right track for open access? Some days I think, yes, some days I think, no, I think it's probably true to say that none of the main total access publishers is interested in having all of their literature open access and having this as the mainstream approach. And they will find ways I believe to keep closed access as a critically important part of what they are selling. It's a model which they're familiar with, which they know how to operate, and at the moment what they're doing is generally making their glamour journals closed access. So, in bioscience this is cell nature and science. I am quite sure those will remain closed for a long period. Most publishers have open access offerings, which are competent, nothing wrong with that, but they're not where most scientists will aim to publish their important results. So, I don't actually think that we're going to see universal open access anytime soon. Having said that the meme is out there, the funders are very keen on open access, and they want people to publish in open access and make their stuff available. So, there's a conflict here. It's one where money is one of the most important things so the publishers have been, have now got the market of about 15 billion with a B dollars a year. And that means that if we were to go open access, we have to switch that amount of money from the closed subscription mechanism to author side funding, and that's going to be incredibly complicated. And I don't see anybody stepping up to cut that Gordian knot. So whether it'll slowly change, I don't know. It's an awful lot of money to shift. Also, with the subscription model, the publishers have the say in who gets what and how things are rated. The publishers are selling reputation. This is not based on any intrinsic measure of reputation. It's based on counting citations, which is about as valuable as counting the number of notes in a piece of music to tell you how good it is. But that's how it's done and they will want to keep that because it's cheap to operate and very lucrative. On the other hand, I think archive is wonderful to give you an example, it costs, totally at $7 to publish a paper in archive. And for many purposes, that's sufficient to communicate results to the community. Now, it needs community comment, call it peer review, but I would call it post publication peer review. And in my view, that's the ideal way to publish where you've got a publication, and then the world adds on what it thinks about it. And the cost of publication is trivial. Compare that with nature where they say, well, it costs us $40,000 to publish a paper. So that's $40,000 against $7. Something is wrong. Yeah. So I want to get back a little bit more on content mine. I can really appreciate some of the challenges you guys have. I've tried to do some liberation of data myself and I know how difficult it can be. And that's only the particular fields I know and I'm interested in. I can't imagine how difficult it must be to scale out to lots of different academic pursuits. So I'm curious about how the variety of data might affect the challenges you guys face, especially when you want to extract things that vary by field. Very good question indeed. We're concentrating on the published scholarly literature, which is about 1.5 million articles a year, somewhere around about a few thousand a day. And they're published from a huge variety of publishers, let's say 1000, with probably about 15 major publishers published in the bulk of that. So that's the mechanism. I would also say we mustn't forget an incredibly valuable resource, which is a student thesis. Students put a lot of work into their thesis. They're heavily peer reviewed. Any of us have been examiners know that. Many of them are not reused. No, they often contain huge amounts of unpublished data or other data, which is not published elsewhere, because in many cases, you know, the student leaves the supervisor moves elsewhere and, you know, they just don't write up all the work that the student has done. So the thesis is often the primary record of that. So I'm very keen in doing theses and I'm particularly keen in countries like Netherlands and France, who've got all their thesis in one place. First of all, discovery is one of the challenges. It's remarkable that even if we spend $15 billion a year on publishing, we don't have an index of the published literature. We don't have an open one. We have Thompson writers web of science, which comes out of current chemical, sorry, current contents many years ago, ISI, but it is selected. It doesn't cover much of the global south. So what we desperately need, and seems to me, almost trivial to create if we had the wheel, is actually a record of what has been published. The best that we have at the moment is cross ref, probably, but cross ref is a publisher funded organization. And although I get on very well with the cross where people, they're always subject to the fact that they're dependent on publisher funding. I really do believe that the world scholarly library community has a duty to make an index, a believable index of the world scientific literature, not just from the rich north. So that's the first thing, discovery. When you've discovered it, then you want to be able to search it. In time, content mine results will be used to help with that. But what we're doing in the first instance is something called get papers. G-E-T-P-A-P-E-R-S, which is a tool by Richard Smith on that from our group, which goes to a collection of papers and allows you to ask a query through their API. So for example, we support archive. We support Europe PubMed Central or NIH. We support repositories such as core. If you have a repository where you've got a lot of stuff which is valuable to you, then get papers is the place to start. And that means, for example, if you go to Europe PubMed Central and ask for dinosaurs, you'll get a few hundred papers in dinosaurs, but the get papers tool will wrap them up in a way which is ideal for the next phase of the process. Excellent. The next tool, I'll go through the four tools, okay, another five tools. The next tool is called QuickScrape, again by Richard, and QuickScrape allows you to put a DOI or more normally a URL into the system and download everything associated with this. So you'll remember I talked about supplemental data. You can go to a paper. I'm going to take plus one as an example, because it's got the largest number of open access papers. You can go to plus one with a URL and ask it to download the PDF, the HTML, the XML, the figures, and the supplemental data, all in one go without putting any more effort into it. And you could download, for example, all the papers published today, which would come to somewhat over a hundred, and it will wrap them all up again in exactly the same form that we need for the next phase of the process. The next tool is called Norma. Now Norma shouldn't be necessary. What Norma does is normalize the publishers formats into a common semantic form by semantic I mean machine processable so that a machine can read it without having to be told how to do it. You shouldn't be necessary because authors create something close to semantic information, but commercial publishers, whether they're total access or open access converted into something that they feel is right for their purposes. They'll turn the word into PDF. They'll turn the images into bitmaps. They'll spray the page with things about how wonderful we are discover other papers, find out what your rating is on Twitter and so on, nothing to do with science at all. And we strip all that stuff off, but what we also have to do in Norma is turn PDFs into HTML, which is hard and lossy, turn the bitmap images into SVG scalable vector graphics, and normalize the text so that we have sections that we understand. Now this is not a hundred percent lossless, but in many cases it's pretty good. So when we finished Norma we've actually got something which is fit for purpose. I wouldn't suggest that anybody do content mining unless they've got something which does the same function as Norma, because otherwise they have to write different tools for every journal. But like all our tools is open source Apache to so anybody can download it, they can use it in commercial programs they can use it for any lawful purpose. Before I go on to the next ones, I should say that you raised the question of isn't every journal different and unfortunately it is. We've had to build this into quick scrape and Norma so quick scrape has a per journal scraper. This is not quite as bad as it sounds because many journals are owned by the same publisher. So if we've done a scraper for one BMC paper we've done it for the lot, so that by writing a scraper for one journal you've often get a few hundred that will use exactly the same format and therefore be scrapable. However, there are journals which do their own thing and we have to write scrapers for them and this is where the community comes in. This as being done through community activity people who know the journal are interested in mining it. So therefore, you know, they have an interest in building a mining tool. This is called Amy now Amy is basically a workflow for discipline specific plugins. So this provides an easy way of mining. There are lots of different things you might want to mine and at the moment we've got a whole list so we can do a bag of words, a word cloud on the paper which tells you what it's about. We can do regular expressions which will find out where words are in a piece of text. We can do chemistry. We can do phylogenetics. We can do species. We can do sequences. We can do identifiers. We can do genes. There's a whole lot of things we can do. There's a lot of things we can't do, but we've built a platform where the only thing you need to worry about is your discipline how you turn text into your material. And you don't have to worry about the housekeeping. That's a voltage of a framework. So that's Amy. And then the results of Amy go into CAT, which is a catalog based on elastic search. And there we have several million data sets that are entry stored and you can search this using a variety of tools based on faceted search, which is what elastic search provides. And it also makes it very easy to start looking for concepts which co-occurs. So which author is connected with this journal and with which subject. And that's a sort of thing which you can ask relatively straightforwardly in elastic search, CAT. At that stage we have a tie up with wiki data and we're going to offer our data to wiki data in case they haven't got it. But we're also going to search wiki data for data to give us another chance of validating if what we've extracted is correct. And we're going to start doing that on species. So that's basically our framework. There's a sixth tool called Canary, which is a UI to run the whole process from a graphical user interface. So you can put in your own URLs and say, please run normal Amy and put the result into CAT, that sort of thing. And in terms of community, what opportunities exist for people to volunteer or to make contributions? We've just set up a community, a community, sorry, a community process and platform. Our community manager is Graham Steele, who's been incredibly active in patient support organizations in the open access arena and so forth. And the way we're doing it is inviting people to set up communities. We're extremely keen on setting up communities, which will have a large or even complete degree of autonomy over how they use the content mind tools to mind the literature. We will provide general support for them in terms of developing the next level of tools of providing material for documentation and we'll also be running workshops. So we've currently got the following groups who are interested in having a community, clinical trials, animal tests, chemistry, hygiene, physics, phylogenetics, taxonomy, plants, neuroscience and crystallography. The requirement for community is that there should be some person who's going to lead it with energy so that we know that it's going to be there in a month, in two months, whatever. That they've got the ability to pull other members of the community around them, and that they will keep up the interest by, let's say, running some sort of heartbeat process of regular mailings, possible stand ups and other things which people use to develop this. Obviously, they won't all progress at the same rate, but that's the general way and that we will support it with probably two monthly catch ups with Graham, myself and other technical people in helping them get over technical and other social problems. The keen that some of them will apply for funding to do this so because there's a lot of interest in content mining, there's a lot of interest in open data in making data available. We think in some communities, it will be possible to have grants which help support this and content mining is very keen to partner with people, usually is a minor partner to help develop this type of process. We've successfully liberated a lot of new data. How do you go about making that available to other people? Well, what we're not going to do is we're not going to create a huge dump of the data with literally a billion facts in. First of all, we don't have the resources to do it. Secondly, we might be challenged by people who have the right, who think they've got the right to have that data. What we're actually going to do is we're going to download and process the papers every day and then to publish the factual metadata that comes out of it on our website. Now, that data may not hang around for more than a week, but other people can access it and build their own resources. So, for example, if we're pulling out species, and one of the things that we want to investigate is endangered species, so we might make a daily list of all mentions of endangered species and the facts associated with them. And then people interested in this conservationist might very well scrape those off our site every day or every week into their own database, and that's actually an ideal way of doing it. It means that we don't have ongoing maintenance problems in lots of domains. The data don't get lost, of course, because these are in the primary literature, and so we can, in principle, always go back and do it again. It's not like an experiment where you've lost a logbook. Making it available on a daily basis is the primary tool. We will be doing some of the things that we are particularly interested in ourself and making small resources available, and also we're going to work very closely with wiki data on the one hand. And in Zurich and Geneva, a group called Plasi, which is doing the taxonomy, and that data is going to be stored in Zenodo, which is a free database run by CERN, the high energy physics community. So we're looking for all sorts of ways that we can make the data reliably available for a reasonable amount of time without costing the community other than marginal costs. That makes a lot of sense, and I really like the approach that there will be other, hopefully, satellite organizations that see the value there and start to mirror that, and that where it makes sense for them, those can become sort of tertiary resources that pick up and kind of maintain some of those data sets. Absolutely. So the other question you asked was about volunteers. Yes, we are very keen to have volunteers. There are two types of volunteers, one concerned with the domain specific tasks. So we would see people coming from the taxonomy community who are interested in building the semantic resources for extracting taxonomic data from the literature. And on the other hand, we'd see people who are interested in general information extraction. So people who are interested in natural language processing, tools to extract tables, analyzing diagrams and images, people who want to port this thing to different types of architecture. It doesn't have to run on Node and Java, which is what we use at the moment, if people want to port it to Python or something like that, that would be great and so forth. Well, excellent. Pius has been really enriching conversation. I want to thank you so much for coming on the show. I encourage listeners to go and check out contentmind.org. That and many other things will be linked to in the show notes there. I'm sure you can learn more. You can find out about volunteering or joining one of those communities we talked about or otherwise just learn about the benefits of the tools and the facts that content mind is extracted and catalogued. And until next week, I want to remind everyone to keep thinking skeptically of and with data. Thanks a lot. (upbeat music) (light music)