Leveraging Artificial Intelligence and Big Data to Create Value
Director, INSITE Center for Business Intelligence and Analytics
Anheuser-Busch Professor of MIS, Entrepreneurship & Innovation
Professor of Computer Science
Eller College of Management
Email: [email protected]
Dr. Sudha Ram
August 19, 2020EROSS-2020
BIG DATA: From Petabytes to ZettaBytes
2
Meaning of “BIG”
Meaning of “BIG”
5
Big Data – Traditionally Defined
VOLUME VARIETY
VELOCITYVERACITY
VALUE
Diverse Sources of Data
Many Different Sources generating Data
An Internet Minute
PARADIGM SHIFT
PARADIGM SHIFT!“Datafication” of the
world
Sensors embedded in Physical Objects
IP Protocol based communication
Health Internet of Things
Paradigm Shift
Temporal and Spatial Dimensions
Billions of Users and Objects
Leaving Massive Traces of Activity
“Laboratory” for understanding the pulse
of humanity
12
QUEST for the HOLY
GRAIL
Predicting the Future
13
INSITE Center for Business Intelligence and Analytics
• Interdisciplinary Research Center at University of Arizona
• www.insiteua.org
14
Creating a Smarter/Better World• Data Science and Network Science• Visualizations Using Time and Space• Scalable techniques for network analysis and graph mining• Predictive Modeling • Train students in Data science• Work on interesting research projects with industry partners to
solve real world problems
15
RESEARCH PROJECTS• Health Care• Education• News Media/Journalism• Crowdfunding• Crowdsourcing• Internet of Things and Wearable devices• Social Media
SOCIAL IMPLICATIONS
16
Leveraging Data Science• Define a problem/challenge• Identify signals • Use data science methods • Solve the problemRepurposing
Data is Key
17
PREDICTION MODELSPredict Emergency Department Visits in near Real Time
Using Big Data
Freshman Retention Prediction
COVID-19 Research
18
Leverage Big dataBig Data not just about volume
• Social media
• Internet search
• Environmental sensors
• Wearable sensors
• Spatial and Temporal Dimensions
• Fine Grained - Spatial/Temporal
19
Focus on Asthma• 25 million people affected in the United States
• 2 million emergency department (ED) visits
• 0.5 million hospitalizations
• 3,500 deaths
• 50 billion dollars in medical costs annually
• 11 million missed school days every year
• 14 million missed work days every year
Source: CDC Reports (2011, 2012)
20
Pediatric asthma ER Visits, USA, 2011
21
Our Research ObjectiveDevelop Robust Models to predict Asthma Related Emergency Department Visits in near Real Time Using Big DataPartner: Parkland Center for Clinical InnovationJoint work with Wenli Zhang, Dr. Yolande Pengetenze, Max Williams, funded in part by Parkland Center for Clinical Innovation
22
Leverage Big dataBig Data not just about volume
• Social media
• Internet search
• Environmental sensors
• Wearable sensors
• Spatial and Temporal Dimensions
• Fine Grained - Spatial/Temporal
23
EXTRACTING SIGNAL from Noisy DataTrue asthma related tweets Not actually related to asthma
24
Asthma Related Tweets
25
Asthma Related Tweets
26
Asthma Keywords
Asthma
Inhaler
Sneezing
Runny Nose
Wheezing
27
Asthma Keywords
Asthma
Inhaler
Sneezing
Runny Nose
Wheezing
28
Asthma-Related Stream
Twitter Asthma Stream - United States
Asthma related tweets, United States, (Asthma stream, 11 Oct, 2013 – 31 Dec, 2013)
29
Extracting Signals
1. Tweets indicating awareness of disease, E.G., “Hope I don’t get an asthma attack again today..”
2. Using disease as rhetoric, e.G., “He is so cute I think I got asthma”
Distinguish tweets that are relevant to asthma from tweets that mentioned asthma in an irrelevant context.
30
Emergency Room Visits and Tweets
31
Air Quality Sensor Data• Identify and include AQI data from a specific
geographic region.
• Collected pollution data from 27 air quality sites around the Dallas area.
• Selected sites closest to the zip codes of the ED asthma patients in our ED visits dataset. Using this data, we calculated daily average AQI for our model.
32
Pollutants• CO: Carbon monoxide• NO2: Nitrogen dioxide • O3: Ozone • Pb: Lead• PM2.5: Atmospheric particulate matter, diameter of 2.5 micrometres
or less • PM10: Atmospheric particulate matter, diameter of 10 micrometres
or less • SO: Sulfur monoxide
33
EPA Pollution Sensor Data and Emergency Visits
34
Prediction Models Using Streaming Data
• Air Quality Sensor data streams• Tweets• Google Trends search data• Machine Learning Techniques to predict
number of ED visits per day with high accuracy
35
Best Predictors
Successfully predicted with 80% accuracy
• # of asthma tweets
• CO
• NO2
• PM2.5
36
USEFUL for Public Health NOTIFICATION
I. Epidemiologic surveillance of asthma disease activity in the community, e.g., the department of health and human services (DHHS)
II. Stakeholders notifications of community-level asthma-disease activity and risk factors
37
Hospital/ED Preparedness
Predicting asthma ED visits and staffing ED consequently
38
Targeted Patient InterventionsTargeted patient interventions using patient address and geo-localization data for tweets. E.g., patient alerts about asthma risks and counseling for preventive methods.
39
ContributionsPromising ResultsDemonstrate the utility and value of linking big data from diverse sources in developing predictive models for non-communicable diseasesSpecific focus on asthmaRelevant for other chronic conditions – Diabetes, Cardiac problems, Obesity
40
Internet of Things and Big Data
Big Data for Improving EducationInternet of Things: Smart Cards, WifiLogs, Mobile Apps
41
BUILDING A SMARTER CAMPUS
Combining Network Science and Machine Learning
42
Societal Challenge: Student RetentionProactive Prediction is very ImportantSocial Science theories indicate:• Social Interactions• Regularity of Routine
ObjectivePredict freshman retention at individual levelMake proactive prediction before knowing first term GPALearn students’ behavioral patterns from their CatCardtransactionsProvide actionable suggestions for retention management
BIG DATAInstitutional Student Dataset
~ 7000 full-time registered freshmen, 6500 are left after removing international students for whom SAT scores or high school GPAs were not available479 (7.37%) drop-out after Fall and 843 (12.98%) drop-out at the end of Spring
SmartCard Transaction Dataset1.8 million transactions made by freshmen from Aug 2012 thru May 2013271 different locations include restaurants, vending machines, printers, parking, labs.
Behavior and Interactions
46
Patterns and Differences
Movement and Behavior
COMPUTATIONAL and NETWORK SCIENCE APPROACH
Fills gaps in behavioral and extant data-driven approachesNew prediction approach
CatCard transactions implicit social networks and spatial sequences
Proactive predictionPredicting retention beforethe end of 1st semester with 90% recall
COVID-19 Related Research Projects
49
What is Contact Tracing?Digital vs. Manual Methods Three Different methods
a. Manual contact Tracingb. Manual with Digital assistance from Prompted Mobility Pathway
aka Memory Joggerc. Digital: BlueTooth App for exposure notification
50
51
Memory Jogger using Wifi Logs
Working with Jeremy Frumkin, Research and Discovery Technologies
Using Wifi network logs with Catcard data to support strategic efforts related to congestion tracking on campus and managing campus foot traffic
Understanding Movement Patterns among Campus spacesComplementing app-based and manual contact tracing efforts with
the additional insights that can be gained through the wifi logs.Design a Memory Jogger – prompted Mobility pathway tool to
enhance manual contact tracing
53
Traffic/Crowd Analysis
Select Date: Feb 3, 2020
Time 8 am-9 am
Building
User types
Traffic on campus between8am and 9amTop ten traffic spots visualized and compared with selected building (in red)
Comparison of hourly Traffic in selected building
To compare the three methods for Contact Tracing and Exposure notification. How do the three contact tracing approaches differ in their outcomes such
as timeliness and coverage of contacts and other metrics? How do these methods complement each other and what are their relative
strengths and weaknesses? How do these methods perform overall in preserving privacy while allowing
for comprehensive contact tracing? What are the tradeoffs? How acceptable are these three strategies to the community and what is
an effective path to deploying comprehensive contact tracing?
55
56
Some General Lessons
• Need for complex techniques? • Is causality really necessary for prediction?• What level of accuracy is good?• Working with your stakeholders is important• Research is very important in training next
generation scientists, end users, students, others
57
Some General Lessons• Focus on defining the problem carefully • Out of the Box thinking• Big Data: Don’t think of it as a single very large dataset • Repurpose and combine different types of data• Exploit the granularity of data especially the spatial and
temporal features: Machine learning and network science
• Extracting Signal from Noise
Good News
58
McKinsey in 2015: predicted that by 2020 the number of data science jobs in the United States alone will exceed 500,000, but there will be fewer than 200,000 available data scientists to fill these positions. Globally, demand for data scientists was projected to exceed supply by more than 50 percent by 2020.
IBM today: Annual demand for the fast-growing new roles of data scientist, data developers, and data engineers will reach nearly 700,000 openings by 2020.
59
CONCLUSION
• PARADIGM SHIFT• BIG DATA HAS A LOT OF HIDDEN VALUE• LET’S LEVERAGE IT USING AI TO CREATE A
BETTER WORLD!
60
QUESTIONS??
TEDx Talk:
http://tedxtucson.com/portfolio/sudha-ram/www.insiteua.org
Email: [email protected]: @sudharam