As presented at Great Wide Open on 02 April 2014 in Atlanta, GA http://www.gwoapp.com/events/open-source-software-for-data-scientists ========================= Harvard Business Review called it "the sexiest job of the 21st century." These days, data scientists are faced with an onslaught of companies pitching products that promise to solve all your problems. Is there such a thing as a "silver bullet" for data science, and is it worth the hefty price tag? This talk will briefly discuss what data science is, it will argue why open source software is usually the right choice for data scientists, and it will examine some of the leading OSS tools for data science available today. Topics will include statistical analysis, data mining, machine learning, natural language processing, and data visualization. Additional materials will be provided on the presentation's companion website: oss4ds.com
Text of Open Source Software for Data Scientists -- Great Wide Open 2014
Open Source Software for Data Scientists Charlie Greenbacker, Director of Data Science02 Apr 2014
Altamira Technologies Corporation 2014 Agenda What is a Data Scientist? Why use Open Source Software? Survey of Open Source Software Tools: Statistical Analysis Data Mining Machine Learning Natural Language Processing Social Network Analysis Data Visualization
Altamira Technologies Corporation 2014 About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable photo: Columbia Pictures
Altamira Technologies Corporation 2014 Best reason for not finishing PhD
http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/ Paul Cooper, ITProPortal.com A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking
Computer Programming Mathematics & Analytic Methodology Distributed Computing & Big Data Data Science StatisticalAnalysis DataMining MachineLearning NaturalLanguageProcessing SocialNetworkAnalysis DataVisualization Domain Knowledge & Communication Skills etc.Altamira Technologies Corporation 2014
Why use Open Source Software?
photo: Karen (https://flic.kr/p/5njby2) THERE ARE NO SILVER BULLETS."
photo: Paul Inkles (https://flic.kr/p/e2QMS5) IF YOUR BOSS BUYS SOMETHING," YOU DAMN WELL BETTER USE IT."
photo: Valugi (http://bit.ly/1jrvVBC) BUDGETS DONT SCALE."
Survey of OSS Tools
Altamira Technologies Corporation 2014 Statistical Analysis Name: R Creator: Gentleman, Ihaka, et al. License: GPL Version 2 Website: r-project.org Source: cran.us.r-project.org/src/base/ Features: Language & environment for statistical computing & viz Linear and nonlinear modeling, classical statistical tests, time-series analysis, graphical techniques, and more 5000+ packages available in CRAN repository
Altamira Technologies Corporation 2014 Data Mining Name: Pandas Creator: Wes McKinney, et al. License: BSD 3-Clause License Website: pandas.pydata.org Source: github.com/pydata/pandas Features: Data analysis workflow in Python DataFrame object for fast manipulation & indexing Tools for reading & writing data between formats Label-based slicing, indexing, and subsetting of data
Altamira Technologies Corporation 2014 Data Mining Name: Impala Creator: Cloudera License: Apache License 2.0 Website: impala.io Source: github.com/cloudera/impala Features: MPP query engine implemented on Hadoop Low latency, high concurrency SQL & BI queries Same interfaces as Apache Hive, but ~24x faster Written in C++; does not use MapReduce
Altamira Technologies Corporation 2014 Machine Learning Name: Scikit-learn Creator: Cournapeau, et al. License: BSD 3-Clause License Website: scikit-learn.org Source: github.com/scikit-learn/scikit-learn Features: ML library for Python built on NumPy, SciPy, matplotlib Support for classification, clustering, dimensionality reduction, regression, model selection, preprocessing SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
Altamira Technologies Corporation 2014 Machine Learning + NLP Name: Mallet Creator: UMass (McCallum, et al.) License: Common Public License 1.0 Website: mallet.cs.umass.edu Source: hg-iesl.cs.umass.edu/hg/mallet Features: Java-based Machine Learning for Language Toolkit Document classification, clustering, topic modeling, information extraction & sequence tagging, etc. Efficient implementation of LDA for topic modeling
Altamira Technologies Corporation 2014 Natural Language Processing Name: NLTK Creator: Bird, Loper, et al. License: Apache License 2.0 Website: nltk.org Source: github.com/nltk/nltk Features: Natural Language Toolkit for Python Built-in support for dozens of corpora & trained models Libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
Altamira Technologies Corporation 2014 Natural Language Processing Name: Stanford CoreNLP Creator: Stanford NLP Group License: GPL Version 2 Website: nlp.stanford.edu/software/corenlp.shtml Source: github.com/stanfordnlp/CoreNLP Features: Suite of high-quality, Java-based NLP tools Includes POS tagger, named entity recognizer, parser, coreference resolution, sentiment analysis, SUTime, etc. Includes models for English, Chinese, Arabic, German
Altamira Technologies Corporation 2014 NLP + Geospatial Analysis Name: CLAVIN Creator: Berico Technologies License: Apache License 2.0 Website: clavin.io Source: github.com/Berico-Technologies/CLAVIN Features: Extracts location names from text, resolves to gazetteer Employs context-based geospatial entity resolution ~75% accuracy, processes 1M documents per hour Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
Altamira Technologies Corporation 2014 Social Network Analysis Name: NetworkX Creator: Los Alamos National Lab License: BSD 3-Clause License Website: networkx.github.io Source: github.com/networkx/networkx Features: Python structures for graphs, digraphs, & multigraphs Support for creating, manipulating, & analyzing the structure, dynamics, & functions of complex networks Provides standard graph algorithms & analysis metrics
Altamira Technologies Corporation 2014 Social Network Analysis Name: Gephi Creator: UTC France License: GPL Version 3 Website: gephi.org Source: github.com/gephi/gephi Features: Network analysis and visualization package for Java Dynamic network analysis with temporal filtering Metrics include: community detection, betweenness, closeness, clustering coefficient, PageRank, etc.
Altamira Technologies Corporation 2014 Fusion, Analysis, and Visualization Name: Lumify Creator: Altamira License: Apache License 2.0 Website: lumify.io Source: github.com/altamiracorp/lumify Features: Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. Integrates structured data, text, images, video Cell-level security & access controls Live, shared collaborative workspaces
Altamira Technologies Corporation 2014 Final Thought Save your $$$ for: People salaries, training, etc. Resources hardware, AWS, etc. Proprietary software if no viable OSS alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ) FINAL THOUGHT Springers
open source software for data scientists oss4ds.com