Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Data Analytics And Analysis Support in Research Services
Kang Lee and Ben Rogers
2
Trends in Research Computing
• Traditional Needs Not Going Away• Large Scale Data Analytics Growing Rapidly• Changing Research Data Sets• Collaborative Data Analytics• Interactive Data & Method Publishing
3
Traditional Needs Not Going Away
4
Traditional Needs Not Going Away
0
2000
4000
6000
8000
10000
Q1
2011
Q2
2011
Q3
2011
Q4
2011
Q1
2012
Q2
2012
Q3
2012
Q4
2012
Q1
2013
Q2
2013
Q3
2013
Q4
2013
Q1
2014
Q2
2014
Q3
2014
Q4
2014
Q1
2015
Q2
2015
Q3
2015
Q4
2015
Q1
2016
Q2
2016
Q3
2016
Q4
2016
Q1
2017
Q2
2017
Q3
2017
Q4
2017
Q1
2018
Total Research Storage (TB)
Dedicated
HPC Infrastructure
RDSS
LSS
5
Large Scale Data Analytics Growing Rapidly
• What do I mean by data analytics?• Applied Statistics• Machine Learning• “Big Data”
• What is “large scale”? - Data Analytics that you can’t efficiently do on a standard desktop system.
6
How Do We Know It’s Growing?
• National GPU Resources Heavily Oversubscribed• AWS Volta Spot Instances at $200/hour during SC17• 59 People Attended Deep Learning Workshop• MRI Grant with Deep Learning Focus – 35 Faculty Support Letters, $3.3
Million• Support emails every week about R or Python• One of the major areas of interest/discussion during Research Computing
Council roadmap development
7
Applied Statistical Analysis On Large Data Sets
• Examples• Identifying significant trends in business data• Examining outcomes in epidemiological data
• Common Tools – R, Python
8
Machine Learning
• Examples• Using Medical Imaging output to diagnose diseases.• Examining effectiveness of captcha in context of modern computer vision algorithms.
• Common Tools – Tensorflow, Caffe, Theano, Torch, scikit-learn, Python
9
“Big Data”
• Not gaining widespread traction. (Hadoop, Spark, etc)• Campus Hadoop pilot use was ~90% coursework.• Why?
• Most structured research data sets not large enough to require these tools on modern servers.
• Disciplinary tools must support new paradigm of data access and computation.
10
Changing Research Data Sets
• Multimodal Data• Data integration challenges• Larger data sets
• Passive Data Collection• Less controlled data collection (messier)• More missing data
• Data Reuse• May not have been collected with current purpose in mind
• Streaming Data• Desire for real-time analysis• Larger data sets
11
Collaborative Data Analytics and Interactive Data & Method Publishing
• Researchers increasingly want to collaborate directly on their data analytics in shared “electronic notebooks”.
• Researchers wish to be able to publish their work to the web with interactive mechanisms so that others can easily explore their results and/or data.
• Platforms - R Studio/Shiny, Jupyter/Jupyter Hub, Custom Web Applications
12
Challenges To Current Campus Services
• Exploratory Data Analytics requires interactivity• Complex Software Stacks
• Tensorflow• Spark/Hive
• Containers (Good & Bad)• GPU costs• Structured data store support• Lack of some needed cloud integrations• Lack of good service/funding models
Data Analytics TrainingInteractive Data Analysis EnvironmentsIowa Quantified Pilot ProjectData Analytics Consulting ExamplesSocial Media Analytics
14
Data Analytics Training
• Data analytics training workshops provided in 2017
• Data Science Institute Spring/Summer (Jan, Jun)
• Introduction to Python Data Analytics (Jun, Aug, Sep, Dec)
• Introduction to Python Data Analytics for the Tippie College of Business (Nov)
• NVIDIA Deep Learning Institute (Jul)
• Web Scraping with Python (Oct)
• XSEDE Big Data Workshop (May, Dec)
15
Data Analytics Training
0
1
2
3
4
5
6
7
8
9
10
Turnouts by Department
A long tail
16
Data Analytics Training
• Three dimensions of data analytics training
• Skill – data collection, refinement, exploration, modeling, visualization, publication
• Tool – widely-used open-source data analytics tools such as Python and R
• Level – introductory, intermediate, advanced
• We’re trying to meet the rising demand for a variety of data analytics training
opportunities from a wide range of disciplines on campus
• UI3 provides regular coursework on informatics and data analytics
17
Data Analytics Training
• Big picture of the direction of data analytics training
General Introduction
Machine Learning
Applied Statistics
Deep Learning
Introductory Level
Intermediate Level
Advanced Level
Domain/Problem-Specific Topics
18
Interactive Data Analysis Environments
Jupyter Notebook
RStudio Desktop
Vs. Jupyter Hub
RStudio ServerVs.
Running on a desktop Running on a server (or in the cloud)
Web-based
19
Interactive Data Analysis Environments
Jupyter Notebook
RStudio Desktop
Vs. Jupyter Hub
RStudio ServerVs.
Useful for individual work Useful for teamwork, teaching and publication
20
Interactive Data Analysis Environments
Funded by NSF (National Science Foundation)
“XSEDE is a single virtual system that scientists can use to interactively share computing resources, data and expertise. People around the world use these resources and services — things like
supercomputers, collections of data and new tools — to improve our planet.”
“Jetstream, led by the Indiana University Pervasive Technology Institute (PTI), adds cloud-based, on-demand computing and data analysis resources to the national cyberinfrastructure.”
Vs.
Vs.
21
Interactive Data Analysis Environments
Quick demo of Jetstream Portal& Jupyter Hub
22
Interactive Data Analysis Environments
• What‘s good for instructors? • They can easily create their own training environment, datasets and class materials
and share them with trainees Have all trainees start with the same environment
Minimize the time needed to tackle technical issues from each computer • The interactive environment is particularly useful for designing hands-on practice • They can have easy control over computing resources allocated to trainees
• What’s good for trainees?• They can benefit from the powerful computing resources of servers and won’t have to
care about the computing power of their own computers
23
Interactive Data Analysis Environments
• Some drawbacks• Instructors are supposed to have minimum knowledge on servers
• How to set up a Linux/Windows server • How to create user accounts • How to install software • How to monitor system resources
• Cloud services could be unavailable when you need them • Due to scheduled maintenance • Due to unexpected hardware failure
24
Iowa Quantified Pilot Project
• A framing question“What would you do
with 10,000, ~$10 wireless sensors?”
25
Iowa Quantified Pilot Project
A Cloud-Based Scientific Gateway for Internet of Things Data Analytics
26
Iowa Quantified Pilot Project
A Cloud-Based Scientific Gateway for Internet of Things Data Analytics
Refers to the network of physical devices, vehicles, home appliances and other items embedded with electronics, software, sensors, actuators, and network connectivity which enables these objects to connect and
exchange data (from https://en.wikipedia.org/wiki/Internet_of_things)
27
Iowa Quantified Pilot Project
A Cloud-Based Scientific Gatewayfor Internet of Things Data Analytics
Refers to an interface designed specifically to support a particular type of scientific research, with an emphasis on supporting the
entire scientific process from start to finish (from https://kb.iu.edu/d/auwv)
28
Iowa Quantified Pilot Project
A Cloud-Based Scientific Gateway for Internet of Things Data Analytics
Fully implemented in the AWS (Amazon Web Services) ecosystemusing AWS IoT, Amazon S3, Amazon Elasticsearch Service,
AWS Lambda, etc.
29
Iowa Quantified Pilot Project
• Sensor deployment
• 20 volt solar panel• IP56 rugged enclosure• 12 volt deep cycle marine battery• 12 volt to 5 volt DC/DC step-down-converter• Raspberry Pi 2 Model B• GSM cellular data modem• Arduino with LoRaWAN module• LoRaWAN data read over serial connection
with python code.• Python handles all gateway tasks and MQTT
communication with
30
Iowa Quantified Pilot Project
• Architecture Amazon Web Services (AWS) Ecosystem
Message Stream Processing
AWS IoT
RulesEngine
Message Broker
Amazon Elasticsearch
Database
Elasticsearch Index
IoT Things -Farm Telemetry
IoT Things -Wind Telemetry
DeviceGateway
DeviceGateway Data Warehouse
Amazon S3
Bucket
Monitoring Dashboards
Kibana
Amazon Elasticsearch
Analytics
Amazon EC2
Jupyter Notebook
Any IoT Things
MQTT
HTTP
AnyProtocol
Analysis
Backup
Monitoring
In-Depth Analysis
R Studio
AWS Lambda
31
Iowa Quantified Pilot Project
Quick demo of Kibana Dashboard
32
Iowa Quantified Pilot Project
• Future directions• Get the software infrastructure organized so that is structured as a service
on campus • Seek projects with funding that can support the infrastructure• Investigate the possibility of an institute that can help faculty develop
their own sensors
33
Data Analytics Consulting Examples
• Consultations I’ve provided in 2017
Problem Detail
Data collection
Web scraping of news articlesWeb scraping of academic papers
Web scraping of product informationTwitter data collection
Data handling Data extraction from the cloud
Modeling Student survey data classification
Insight development Awarded grant analysis
Data analysis strategiesMonitoring target audience on social media
Text analytics support for UI researchersNetwork/topic analysis
Software
Parallelization in PythonVariable scoping in MATLAB
Running Jupyter Notebook on HPCBuilding a webserver for data visualization
34
Social Media Analytics
• Social media analysis depends heavily on data collection & management • Web scraping vs. API (Application Programming Interface)
• A dedicated server needed for continuous data collection • The responsibility of the data falls on the user when collected
• Types of analytics • Statistics for understanding numbers • Text analytics for understanding text• Network analytics for understanding user/keyword networks• Geospatial analytics for understanding geographical or spatial characteristics
35
Social Media Analytics
• Social Media Interest Group • A diverse group with 9 faculty members, 3 staff members, 1 graduate student who are
interested in social media • Gather once a month to share information and look for collaboration opportunities • Anyone interested is welcome to join
• Computational Psychiatry Interest Group • A diverse group led by Prof. Jacob Michaelson at Psychiatry with a number of faculty
members and MDs• Focused on computational aspects of psychiatry • Gather once a month to share information and look for collaboration opportunities
36
Social Media Analytics
• Twitter analysis • My own tweet stream collection for more than two years• +100K random English tweets (+100GB) per month • Contact me if you need to do quick analysis using Twitter data • We’re looking to provide Twitter data as a service on campus
37
Social Media Analytics
Quick demo of a Twitter analysisexample on addiction