SciDataCon 2014 TDM Workshop Intro Slides

Text and Data Mining (TDM) SciDataCon 2014 Workshop

Jenny Molloy (@jenny_molloy) | Puneet Kishor (@punkish)

https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014

1982

“Automatically generating logical representations of text passages... by means of an analysis of the coherence structure of the passages.”Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th conference on Computational linguistics - Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, , Czechoslovakia, 127-132. DOI=10.3115/991813.991833 http://dx.doi.org/10.3115/991813.991833

1999

“(semi)automated discovery of trends and patterns across very large datasets”“Use of large online text collections to discover new facts and trends...”“(Automating) the tedious parts of the text manipulation process and (integrating) underlying computationally-driven text analysis with human-guided decision making within exploratory data analysis over text”Marti A. Hearst. 1999. Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99). Association for Computational Linguistics, Stroudsburg, PA, USA, 3-10. DOI=10.3115/1034678.1034679 http://dx.doi.org/10.3115/1034678.1034679

2008

“The use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature.”Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.PMID 18225946.

What is MINING?


What is CONTENT?● Images

● Photos

● Graphs

● Figures

● Captions

● Sound

● Video

● Tables

● Datasets

● Supplementary information

● Metadata

● Text


101 uses for content mining (nearly)...Which universities in SE Asia do scientists from Cambridge work with? (We get asked this sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of their co-authors we can get a very good approximation. (Feasible now).

Which papers contain grayscale images which could be interpreted as Gels? A http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A typical gel (Wikipedia CC-BY-SA) looks like

Find me papers in subjects which are (not) editorials, news, corrections, retractions, reviews, etc. Slightly journal/publisher-dependent but otherwise very simple.

Find papers about chemistry in the German language. Highly tractable. Typical approach would be to find the 50 commonest words (e.g. “ein”, “das”,…) in a paper and show the frequency is very different from English (“one”, “the” …)

Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate.

Find uses of the term “Open Data” before 2006. Remarkably the term was almost unknown before 2006 when I started a Wikipedia article on it.

Find papers where authors come from chemistry department(s) and a linguistics department. Easyish (assuming the departments have reasonable names and you have some aliases (“Molecular Sciences”, “Biochemistry”)…)

Find papers acknowledging support from the Wellcome Trust . (So we can check for OA compliance…).

Find papers with supplemental data files. Journal-specific but easily scalable.

Find papers with embedded mathematics. Lots of possible approaches. Equations are often whitespaced, text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an enthusiast