HIM 6665 Healthcare Data Mining and Predictive Analytics€¦ · ealthcare Data Mining and Predictive Analytics (HIM 6665) is a graduate-level course designed to introduce students

Healthcare Data Mining & Predictive Analytics

HIM 6665

Morsani College of Medicine

HIM 6665 Healthcare Data Mining & Predictive Analytics 1

HEALTHCARE DATA MINING & PREDICTIVE ANALYTICS

(HIM 6665) Course Director:

Athanasios Tsalatsanis ' (813) 396-2605 7 (813) 905-8909 . Canvas integrated email system

Course Format: This course is delivered online through the USF Canvas Portal (https://usflearn.instructure.com).

Course Objectives: ealthcare Data Mining and Predictive Analytics (HIM 6665) is a graduate-level course designed to introduce students to various data mining concepts and algorithms.

It emphasizes on classifiers, clustering, and association analysis applicable to the distinct nature of healthcare data. The terms data mining and predictive analytics refer to the computational process of discovering patters in large datasets using interdisciplinary methods such as artificial intelligence, machine learning, and statistics. Ultimate goal of data mining is to extract previously unknown information from data residing in a preferably electronic dataset. This information can be later used to support healthcare decisions such as medical treatment, prognosis, diagnosis etc. and other types of decisions such as targeted marketing, detection of money laundering, credit card utilization analysis etc.

With regards to biomedical data, data mining techniques have been widely used on different areas of biomedicine including genomics, proteomics, and medical diagnosis. In particular, due to the predictive power of data mining techniques, there are several studies modeling the oftentimes non-linear relationships between dependent and independent variables. Examples include classification algorithms for the diagnosis of pigmented skin lesions, detection of spikes in EEGs to diagnoses neurological disorders related to epilepsy, and classification of lung sound signals to assist diagnosis. With regards to molecular biology, clustering analysis is

primarily applied to microarray gene expression data to identify groups of genes sharing similar expression profiles. Besides clustering, other predictive data mining techniques have been used to predict protein secondary structure, prediction of protein backbone angle, biological effects, and DNA binding. Most recent research efforts in healthcare data mining relate to text mining and natural language selection, where automated methodologies are used to scan non-indexed text and compile domain specific information. Examples of published research on text mining include: utilization

H


of simple morphological clues to recognize the names of proteins and other materials, entity extraction by classifying words into 24 entity classes, identify genes and proteins in text, resolve the classification of a biological entity, mapping between abbreviations and full names. Proficiency in data mining methodologies is a fundamental requirement for the field of healthcare analytics. This course aims at providing both theoretical and practical coverage of all data mining topics such as predictive modeling, association analysis, clustering, anomaly detection, and visualization with special focus on healthcare applications. After completing this course, students should be able to:

• Communicate using data mining terminology • Distinguish between predictive and descriptive methods • Explain various data mining methods • Apply commonly used data mining methods in healthcare datasets • Use data mining software

Instructor’s Office Hours: There are 2 modes. 1) Integrated chat tool – by appointment 2) Email – anytime – (The instructor will make every effort to respond within 2 business days)

Location: This is a web-based course hosted in Canvas. It can be accessed via https://usflearn.instructure.com/

Course Credit Hours: 3 credit hours

Course Prerequisites: The course is open to all graduate students, those admitted to the Graduate Certificate in Health Informatics and Master’s students in the Health Informatics Health Analytics concentration.

Who To Contact and How: For course content related questions - contact the instructor directly. For Canvas related technological support, please contact USF IT Help Desk at (813) 974-1222.

Course Format: This course is web-based. Course materials and assignments will be posted on the course website. The course is divided into weekly “Sessions” and includes the following elements: Reading Assignments:

Specific chapters in the textbook required for the course as well as research papers will be assigned for each session. The reading assignments are the primary means by which each student will acquire the core content of the course. It is essential that students complete the reading assignments for comprehension early in each session.


Quizzes: For each chapter in the reading assignment, a quiz will be posted with which students can assess their level of comprehension of the reading assignment. Grades will be posted in the grade book for each quiz but the quiz grades will not be included in the calculation of the final course grade. Presentations: Presentations in narrated power point format will be included for each session. These presentations are intended to emphasize the main topics of the reading assignments and the clinical importance as related to the particular session topic. Assignments: Weekly assignments will be provided to students. These assignments will be related to and expand on the main topics of the reading assignment. The goal of each assignment is be to enable students to research various topics outside the textbook and presentations. Assignment grades are included in the calculation of the final course grade. Discussions: Class discussion topics will be introduced in each session. All students are expected to participate in the class discussions. Case studies, question answer activities, as well as identification of valuable web resources will be the focus of the discussions. Participation in discussions is included in the calculation of the final course grade.

Exams: Exams will be administered in two occasions during the class to assess students understanding of the course material that includes reading assignments, presentations, and weekly assignments. The midterm exam will be administered during the 4th week of the course and the final exam will be administered during the 8th week of the course. Exam grades are included in the calculation of the final grade.

Quizzes: There will be a quiz for each of the assigned chapters of the textbook. Each quiz will comprise of questions randomly selected from a pool of questions that pertain to the specific session’s reading assignments. All quizzes will be delivered in Canvas. Each quiz can be taken once so that students can determine the level of understanding of the reading material. The quiz scores will not count as part of the grade. Quizzes will be available only during their respective week. For example Quiz 1 will not be available on weeks 2,3,4,5,6,7, and 8.

Examinations: All examinations will be delivered in Canvas. There will be a total of two “Exams”, a comprehensive “Midterm Exam” and comprehensive “Final Exam”.

Comprehensive Midterm Exam: This exam will be comprehensive of all material covered in Sessions 1-4 of the course. A minimum of 10% of the questions for the Midterm Exam will be taken from the pool of questions used for the quizzes. The midterm exam will be administered during the week of Session 4 and must be started at least one hour before Session’s 4 Sunday midnight (EST). The exam will have 60 multiple choice and/or True/False questions and there will be a time limit of


50 minutes. The Midterm exam will be available only within the allotted timeframe and students can take the exam once.

Comprehensive Final Exam:

This exam will be comprehensive of all material covered during the course. A minimum of 10% of the questions for the Comprehensive Final Exam will be taken from the pool of questions used for the quizzes. The final exam will be administered during the week of the 8th Session and it must be started at least one hour before Session’s 8 Sunday midnight (EST). The exam will have 60 multiple choice and/or True/False questions and there will be a time limit of 60 minutes. The Final exam will be available only within the allotted timeframe and students can take the exam once. Discussions: Discussions are asynchronous, meaning that participants post messages to discussion lists. This is a lot like using a bulletin board. The advantage of this is that participants do not have to find a time when everyone can log in simultaneously. However, because exchange of ideas is so important, participants will have to be working on the same topics at roughly the same time. It is not possible, therefore, for participants to work entirely at their own pace, for example by doing all coursework in the first few days of the course or by leaving all coursework until the end. Therefore, it is imperative that you be able to participate in the discussions on a regular basis during the course. If you have questions about whether the course will be flexible enough for your purposes, please contact the course instructor. There is a minimum of three posts for each discussion topic. The primary post should answer the main question(s) of the discussion and the two commentary posts should be used to comment on classmates’ posts. Primary posts less than 300 words and commentary posts less than 100 words are not sufficient and will not be accepted. Assignments: Assignments are mandatory and must be submitted in Canvas. Every week, a template of the assignment will be emailed to the students. Students should research the assigned topics and populate the template document with their answers. Once completed students must upload their work into Canvas. All submitted documents will be automatically reviewed for plagiarism using Turnitin, which will generate a plagiarism report. In order to view this report and modify their work if necessary, students must submit their assignments at least 3 hours before the assignment deadline. Turnitin scores greater than 15% (excluding the assignment’s template and references) may result in a zero grade due to plagiarism or lack of intellectual effort. Due Dates: Each session’s deliverables must be submitted by the end of the session’s respective week (Sunday midnight). All course deadlines are in Eastern Standard Time. Extensions and Makeup assignments/exams Late submissions of Assignments, incomplete assignments, or exam completions will NOT be accepted. Extensions are only given due to extreme circumstances, or emergencies. Students are required to provide appropriate documentation, which will be determined acceptable by the


instructor BEFORE being granted an extension and BEFORE a deadline. Absolutely no extension is granted beyond the end of the class (last Sunday of the 8’th week). Examples of events qualifying for consideration include:

a. Illness of the student or immediate family (parent, spouse, child, sibling, or grandparents) of such severity or duration to preclude completion of the assignment(s) or exam(s) as confirmed in writing by a physician (M.D.).

b. Death in the immediate family (parent, spouse, child, sibling, or grandparents) as confirmed by documentation (death certificate, obituary) indicating the student’s relationship to the deceased.

c. Involuntary call to active military duty as confirmed by military orders. d. A situation in which the University is in error as confirmed by an appropriate University

official. e. Other documented exceptional circumstances beyond the control of the student which

precluded completion of the assignment(s) or exam(s) accompanied by explanatory letter and supporting documentation.

Student Performance and Final Course Grade Calculation: Your final course grade will be determined by the weighting of your percent score for the Discussions, Assignments, Project, Comprehensive Midterm Exam and Comprehensive Final Exam as follows.

Discussions: 5% Assignments: 30% Comprehensive Midterm Exam 30% Comprehensive Final Exam 35%

Final course grade will be based on a percentage performance basis for the course using the following + grading scale that is recommended for all College of Medicine graduate courses:

Letter Grade Grade A 92-100 A- 89-91

B+ 87-88 B 82-86 B- 79-81 C+ 77-78 C 72-76 C- 69-71 D+ 67-68 D 62-66 D- 59-61

F <59 "I" (Incomplete grade) Policy: Students sometimes fail to progress in technology-centered courses because adequate prerequisite technical skills are not possessed or because adequate time management and study skills are not exercised. These are NOT appropriate bases for the issuance of an "Incomplete" grade. No "I" grades will be awarded in this course without extenuating, documented circumstances, such as death in the family or extended illness. If you should happen to arrive in such unfortunate circumstances, be sure to provide the instructor with suitable documentation. Don't ask the instructor what form the documentation should take or what is acceptable. If the instructor finds any problem with it he will let you know. "I" requests must be made and valid documentation provided before the course is over and grades have been issued. Your "I" will buy you one more semester in which to finish your work. If you haven't earned a higher


grade by that time, your grade will convert permanently to an "F" and there will be no way to complete the course. If you wish at that point to continue, you will have to start a new by re-registering (and re-paying) for the course.

Course Overview: This course is designed for future healthcare professionals who are interested in managing, performing, evaluating, and validating healthcare research that involves analysis of data. Students in healthcare analytics will greatly benefit from the data analysis methodologies presented. The course navigates data mining through a series of seven distinct areas that span from introductory concepts to advanced applications. It introduces the major principles and techniques used in data mining from an algorithmic perspective to enable students better understand how data mining technology can be applied to various kinds of data. It discusses the basic types of data, data quality, preprocessing techniques and measures of similarity and dissimilarity. Summary statistics and visualization techniques are discussed. Next, the course covers various classifiers such as decision trees, rule-based systems, nearest neighbor techniques, Bayesians techniques, artificial neural networks, support vector machines, and ensemble classifiers. Furthermore, the course describes the basics of association analysis with emphasis on frequent itemsets and association rules. A variety of more advanced topics, including how association analysis can be applied to categorical and continuous data or to data that has a concept hierarchy is presented. Next, various types of clusters and clustering techniques are discussed. Finally, the special topic of anomaly detection is described. “Healthcare data mining and predictive analytics” is developed in a “distance’ format that will cater to students who are currently employed and cannot accommodate the schedules of the regular didactic courses that are offered during the traditional College of Medicine academic schedule. This course is part of the Medical Sciences Master’s concentration in Health Analytics that provides a valuable opportunity for students to gain a deeper understanding of the principles of information technology as applied to modern healthcare research enterprise and delivery. The course content follows a traditional introductory curriculum in data mining and includes topics in various types of mining. The course also includes the discussion of a number of case studies designed to aid reinforce the theories and concepts concerning healthcare data mining developed during the course. The inclusion of a distance mode of delivery of the course also enables geographically-dispersed students or those currently engaged in full-time employment, convenient access to the courses and the program. The course material is presented in a “modular” format, which presents the essential information in an integrated approach. The course requires extensive “on-line” participation plus additional hours of reading, writing, and research. Course participants will be introduced to the modern principles of health informatics and their application to research and clinical care. All course


work can be done on a participant’s individual computer. There will be extensive online discussions with other course participants.

COURSE TEXTBOOK(S) The following textbook is required for this course. In order to appropriately address the teaching objectives of the course, students will be responsible for subject material from the assigned readings that are not covered in the lecture modules. Moreover, maximum benefit will be obtained by reading appropriate textbook material (Assigned Reading) before viewing each module. Students are responsible for purchasing their textbook before the term starts. Pang-Ning, Tan, Michael Steinbach, and Vipin Kumar. "Introduction to data mining." Library of Congress. 2006. Additional suggested readings and resources may be found within the following textbooks: Dunham, Margaret H. Data mining: Introductory and advanced topics. Pearson Education, 2006. Dursun Delen, Real-world data mining: Applied business analytics and decision making. Pearson Education, 2015 Chen, Hsinchun, et al., eds. Medical informatics: knowledge management and data mining in biomedicine. Vol. 8. Springer Science & Business Media, 2006. Disability accommodation: Information regarding qualifications for student disabilities through the Disabled Student Academic Services Office (DSA) at the University of South Florida can be found online at: http://download.grad.usf.edu/PDF/section14.pdf. Students can also directly contact the DSA for arrangement of academic accommodations and assistance at (813) 974-4309, SVC 2043, Coordinator of Disabled Student Academic Services.

Holidays and Religious Observations:


Students who anticipate that they will be unable to complete any aspect of this course due to the observation of a major religious observance must provide written notice to the instructor by the end of the second week of the course.

Plagiarism Detection: The University of South Florida has an account with an automated plagiarism detection service which allows instructors to submit student assignments to be checked for plagiarism. The instructor reserves the right to 1) request that assignments be submitted as electronic files and 2) electronically submit assignments to Canvas/Turnitin. Assignments are compared automatically with a huge database of journal articles, web articles, and previously submitted papers. The instructor receives a report showing exactly how a student’s paper was plagiarized. Copyright infringement All course material including, but not limited to, presentations, quiz/exam questions and answers, assignments and their solutions, and discussion topics are the intellectual property of the instructor. Students are prohibited from sharing, posting, or by any means dissiminating this material outside the course without prior approval from the instructor. Copyright violations are considered cheating under the graduate student code of conduct. Online Proctoring Examinations in HIM 6667 may utilize online proctoring in Canvas to ensure exam/quiz integrity which requires special technological requirements. As stated in the appropriate USF announcement: “All students must review the syllabus and the requirements including the online terms and video testing requirements to determine if they wish to remain in the course. Enrollment in the course is an agreement to abide by and accept all terms. Any student may elect to drop or withdraw from this course before the end of the drop/add period. Online exams and quizzes within this course may require online proctoring. Therefore, students will be required to have a webcam (USB or internal) with a microphone when taking an exam or quiz. Students understand that this remote recording device is purchased and controlled by the student and that recordings from any private residence must be done with the permission of any person residing in the residence. To avoid any concerns in this regard, students should select private spaces for the testing. The University library and other academic sites at the University offer secure private settings for recordings and students with concerns may discuss location of an appropriate space for the recordings with their instructor or advisor. Students must ensure that any recordings do not invade any third party privacy rights and accept all responsibility and liability for violations of any third party privacy concerns. Setup information will be provided prior to taking the proctored exam. For additional information about online proctoring you can visit the online proctoring student FAQ (Links to an external site.)”


HEALTHCARE DATA MINING & PREDICTIVE ANALYTICS LEARNING OBJECTIVES

Topic 1: Introduction

After this topic, the student should be able to: Explain what data mining is Report the origins of data mining Identify data mining tasks Topic 2: Healthcare Data

After this topic, the student should be able to: Describe different types of data Explain the importance of data quality Use methods for data preprocessing Discuss measures of similarity and dissimilarity Topic 3: Exploring healthcare data

After this topic, the student should be able to: Present the iris dataset Perform summary statistics Perform basic visualization Explain multidimensional data analysis Topic 4: Classification of healthcare data

After this topic, the student should be able to: Use basic concepts, decision trees, and model evaluation methods Describe alternative techniques such as rule-based, nearest-neighbor, Bayesian, artificial neural networks, support vector machines, and ensemble methods Explain the class imbalance problem Present the multiclass problem Topic 5: Association analysis

After this topic, the student should be able to: Describe basic concepts and algorithms Explain generation of frequent itemset Describe the rule generation process Explain the FP growth algorithm Perform evaluation of association patterns Discuss the effect of skewed support distribution Discuss methods for handling categorical and continuous attributes Topic 6: Cluster analysis

After this topic, the student should be able to: Use methods such as k-means, agglomerative hierarchical clustering, DBSCAN, and cluster evaluation Discuss advanced methods such as prototype-based clustering, density-base clustering, and graph-based clustering Topic 7: Anomaly detection

After this topic, the student should be able to: Describe statistical approaches for anomaly detection Review proximity-, density-, and clustering-based techniques for anomaly detection


Appendix A University of South Florida Student Conduct Policies: http://www.sa.usf.edu/ Online Conduct/Academic Dishonesty:

All members of this course shall foster an environment that encourages adherence to the principles of honesty and integrity. All parties shall protect the integrity of academic materials including test materials, copyrighted documents, and all related course work.

Students are expected to represent themselves honestly in all work submitted. The presence of a student’s name on any material submitted in completion of an assignment is considered to be an assurance that both the work and ideas are the result of the student’s own intellectual effort, and produced independently. Collaboration is not allowed unless specifically permitted by the instructors. All course participants are expected to respect others’ personal feelings; have the right of freedom to hear and participate in dialogue and to examine diverse ideas; and have the right to a learning environment free from harassment and discrimination; and the responsibility that free discussion represents the scholarly nature of the learning community.

Cheating (the unauthorized giving, receiving, or use of material or information in quizzes, assignments or other course work or the attempt to do so) or plagiarism (the use of ideas, data or specific passages of another person’s published or unpublished work that is either unacknowledged or falsely acknowledged) is not acceptable in this course.

The use of Internet resources when writing your paper should be kept to a minimum. It is not acceptable to use on-line abstracts or resources of questionable authority in your paper. The web is acceptable for certain data sources e.g. CDC or census data. It is acceptable to use full text journal articles that are on-line.

Academic Dishonesty & Disruption for Academic Process Policy See http://www.grad.usf.edu/policies.asp Plagiarism & Punishment Guidelines for Plagiarism: See http://www.grad.usf.edu/policies.asp

Plagiarism is defined as ‘literary theft’ and consists of the unattributed quotation of the exact words of a published text, or the unattributed borrowing of original ideas by paraphrase from a published text. On written papers for which the student employs information gathered from books, articles, or oral sources, each direct quotation, as well as ideas and facts that are not generally known to the public at large, or the form, structure, style of a secondary source must be attributed to its author by means of the appropriate citation procedure. Only widely known facts and thoughts and observations original to the student do not require citations. Citations may be made in footnotes or within the body of the text. Plagiarism, also, consists of passing off as one’s own, segments or the total of another person’s work.

Cheating is defined as follows: (a) the unauthorized granting or receiving of aid during the prescribed period of a course-graded exercise: students may not consult written materials such as notes or books, may not look at the paper of another student, nor consult orally with any other student taking the same test;


(b) Asking another person to take an examination in his/her place; (c) Taking an examination for or in place of another student; (d) Stealing visual concepts, such as drawings, sketches, diagrams, musical programs and scores, graphs, maps, etc., and presenting them as one's own; (e) Stealing, borrowing, buying, or disseminating tests, answer keys or other examination material except as officially authorized, research papers, creative papers, speeches, etc. (f) Stealing or copying of computer programs and presenting them as one's own. Such stealing includes the use of another student's program, as obtained from the magnetic media or interactive terminals or from cards, print-out paper, etc. Punishment for such Academic Dishonesties will depend on the seriousness of the offense and may include receipt of an ‘F’ or ‘O’ grade on the subject paper, lab report, etc., an ‘F’ in the course, suspension or expulsion from the University. The University drop policies and forgiveness policies shall be suspended for a student accused of plagiarism or cheating or both.


Healthcare Data Mining Glossary1 Accuracy. A measure of a predictive model that reflects the proportionate number of times that the model is correct when applied to data. Application Programming Interface (API). The formally defined programming language interface between a program (system control program, licensed program) and its user. Artificial Intelligence. The scientific field concerned with the creation of intelligent behavior in a machine. Artificial Neural Network (ANN). See Neural Network. Association Rule. A rule in the form of “if this then that” that associates events in a database. For example the association between purchased items at a supermarket. Back Propagation. One of the most common learning algorithms for training neural networks. Binning. The process of breaking up continuous values into bins. Usually done as a preprocessing step for some data mining algorithms. For example breaking up age into bins for every ten years. Brute Force Algorithm. A computer technique that exhaustively uses the repetition of very simple steps repeated in order to find an optimal solution. They stand in contrast to complex techniques that are less wasteful in moving toward and optimal solution but are harder to construct and are more computationally expensive to execute. Cardinality. The number of different values a categorical predictor or OLAP dimension can have. High cardinality predictors and dimensions have large numbers of different values (e.g. zip codes), low cardinality fields have few different values (e.g. eye color). CART. Classification and Regression Trees. A type of decision tree algorithm that automates the pruning process through cross validation and other techniques. CHAID. Chi-Square Automatic Interaction Detector. A decision tree that uses contingency tables and the chi-square test to create the tree. Classification. The process of learning to distinguish and discriminate between different input patterns using a supervised training algorithm. Classification is the process of determining that a record belongs to a group. Clustering. The technique of grouping records together based on their locality and connectivity within the n-dimensional space. This is an unsupervised learning technique. Collinearity. The property of two predictors showing significant correlation without a causal relationship between them. Clustering. The process of grouping similar input patterns together using an unsupervised training algorithm. Conditional Probability. The probability of an event happening given that some event has already occurred. For example the chance of a person committing fraud is much greater given that the person had previously committed fraud. Coverage. A number that represents either the number of times that a rule can be applied or the percentage of times that it can be applied. CRM. See Customer Relationship Management. Cross Validation (and Test Set Validation). The process of holding aside some training data which is not used to build a predictive model and to later use that data to estimate the accuracy of the model on unseen data simulating the real world deployment of the model. Customer Relationship Management. The process by which companies manage their interactions with customers.


Data mining. The process of efficient discovery of nonobvious valuable patterns from a large collection of data. Database Management System (DBMS). A software system that controls and manages the data to eliminate data redundancy and to ensure data integrity, consistency and availability, among other features. Decision Trees. A class of data mining and statistical methods that form tree like predictive models. Embedded Data Mining. An implementation of data mining where the data mining algorithms are embedded into existing data stores and information delivery processes rather than requiring data extraction and new data stores. Entropy. A measure often used in data mining algorithms that measures the disorder of a set of data. Error Rate. A number that reflects the rate of errors made by a predictive model. It is one minus the accuracy. Expert System. A data processing system comprising a knowledge base (rules), an inference (rules) engine, and a working memory. Exploratory Data Analysis. The processes and techniques for general exploration of data for patterns in preparation for more directed analysis of the data. Factor Analysis. A statistical technique which seeks to reduce the number of total predictors from a large number to only a few “factors” that have the majority of the impact on the predicted outcome. Field. The structural component of a database that is common to all records in the database. Fields have values. Also called features, attributes, variables, table columns, dimensions. Front Office. The part of a company's computer system that is responsible for keeping track of relationships with customers. Fuzzy Logic. A system of logic based on the fuzzy set theory. Fuzzy Set. A set of items whose degree of membership in the set may range from 0 to 1. Fuzzy system. A set of rules using fuzzy linguistic variables described by fuzzy sets and processed using fuzzy logic operations. Genetic algorithm. A method of solving optimization problems using parallel search, based on Darwin's biological model of natural selection and survival of the fittest. Genetic operator. An operation on the population member strings in a genetic algorithm which are used to produce new strings. Gini Metric. A measure of the disorder reduction caused by the splitting of data in a decision tree algorithm. Gini and the entropy metric are the most popular ways of selected predictors in the CART decision tree algorithm. Hebbian Learning. One of the simplest and oldest forms of training a neural network. It is loosely based on observations of the human brain. The neural net link weights are strengthened between any nodes that are active at the same time. Hill Climbing Search. A simple optimization technique that modifies a proposed solution by a small amount and then accepts it if it is better than the previous solution. The technique can be slow and suffers from being caught in local optima. Hypothesis Testing. The statistical process of proposing a hypothesis to explain the existing data and then testing to see the likelihood of that hypothesis being the explanation. ID3. One of the earliest decision tree algorithms.


Independence (statistical). The property of two events displaying no causality or relationship of any kind. This can be quantitatively defined as occurring when the product of the probabilities of each event is equal to the probability of the both events occurring. Intelligent Agent. A software application which assists a system or a user by automating a task. Intelligent agents must recognize events and use domain knowledge to take appropriate actions based on those events. Kohonen Networks. A type of neural network where locality of the nodes learn as local neighborhoods and locality of the nodes is important in the training process. They are often used for clustering. Knowledge Discovery. A term often used interchangeably with data mining. Lift. A number representing the increase in responses from a targeted marketing application using a predictive model over the response rate achieved when no model is used. Machine Learning. A field of science and technology concerned with building machines that learn. In general it differs from Artificial Intelligence in that learning is considered to be just one of a number of ways of creating an artificial intelligence. Memory-Based Reasoning (MBR). A technique for classifying records in a database by comparing them with similar records that are already classified. A form of nearest neighbor classification. Minimum Description Length (MDL) Principle. The idea that the least complex predictive model (with acceptable accuracy) will be the one that best reflects the true underlying model and performs most accurately on new data. Model. A description that adequately explains and predicts relevant data but is generally much smaller than the data itself. Nearest Neighbor. A data mining technique that performs prediction by finding the prediction value of records (near neighbors) similar to the record to be predicted. Neural Network. A computing model based on the architecture of the brain. A neural network consists of multiple simple processing units connected by adaptive weights. Nominal Categorical Predictor. A predictor that is categorical (finite cardinality) but where the values of the predictor have no particular order. For example, red, green, blue as values for the predictor “eye color”. Occam’s Razor. A rule of thumb used by many scientists that advocates favoring the simplest theory that adequately explains (or predicts) an event. This is more formally captured for machine learning and data mining as the minimum description length principle. On-Line Analytical Processing (OLAP). Computer-based techniques used to analyze trends and perform business analysis using multidimensional views of business data. Ordinal Categorical Predictor. A categorical predictor (i.e. has finite number of values) where the values have order but do not convey meaningful intervals or distances between them. For example the values high, middle and low for the income predictor. Outlier Analysis. A type of data analysis that seeks to determine and report on records in the database that are significantly different from expectations. The technique is used for data cleansing, spotting emerging trends and recognizing unusually good or bad performers. Overfitting. The effect in data analysis, data mining and biological learning of training too closely on limited available data and building models that do not generalize well to new unseen data. At the limit, overfitting is synonymous with rote memorization where no generalized model of future situations is built.


Predictor. The column or field in a database that could be used to build a predictive model to predict the values in another field or column. Also called variable, independent variable, dimension, or feature. Prediction. 1. Then or field in a database that currently has unknown value that will be assigned when a predictive model is run over other predictor values in the record. Also called dependent variable, target, classification. 2. The process of applying a predictive model to a record. Generally prediction implies the generation of unknown values within time series though in this book prediction is used to mean any process for assigning values to previously unassigned fields including classification and regression. Predictive Model. A model created or used to perform prediction. In contrast to models created solely for pattern detection, exploration or general organization of the data. Principle Components Analysis. A data analysis technique that seeks to weight the importance of a variety of predictors so that they optimally discriminate between various possible predicted outcomes. Prior Probability. The probability of an event occurring without dependence on (conditional to) some other event. In contrast to conditional probability. Radial Basis Function Networks. Neural networks that combine some of the advantages of neural networks with those of nearest neighbor techniques. In radial basis functions the hidden layer is made up of nodes that represent prototypes or clusters of records. Record. The fundamental data structure used for performing data analysis. Also called a table row or example. A typical record would be the structure that contains all relevant information pertinent to one particular customer or account. Regression. A data analysis technique classically used in statistics for building predictive models for continuous prediction fields. The technique automatically determines a mathematical equation that minimizes some measure of the error between the prediction from the regression model and the actual data. Reinforcement learning. A training model where an intelligence engine (e.g. neural network) is presented with a sequence of input data followed by a reinforcement signal. Relational Database (RDB). A database built to conform to the relational data model; includes the catalog and all the data described therein. Response. A binary prediction field that indicates response or non response to a variety of marketing interventions. The term is generally used when referring to models that predict response or to the response field itself. Sampling. The process by which only a fraction of all available data is used to build a model or perform exploratory analysis. Sampling can provide relatively good models at much less computational expense than using the entire database. Segmentation. The process or result of the process that creates mutually exclusive collections of records that share similar attributes either in unsupervised learning (such as clustering) or in supervised learning for a particular prediction field. Sensitivity Analysis. The process which determines the sensitivity of a predictive model to small fluctuations in predictor value. Through this technique end users can gauge the effects of noise and environmental change on the accuracy of the model. Simulated Annealing. An optimization algorithm loosely based on the physical process of annealing metals through controlled heating and cooling. Structured Query Language (SQL). A standard language for the access of data in a relational database.


Supervised learning. A class of data mining and machine learning applications and techniques where the system builds a model based on the prediction of a well defined prediction field. This is in contrast to unsupervised learning where there is no particular goal aside from pattern detection. Support. The relative frequency or number of times a rule produced by a rule induction system occurs within the database. The higher the support the better the chance of the rule capturing a statistically significant pattern. Targeted Marketing. The marketing of products to select groups of consumers that are more likely than average to be interested in the offer. Time-series forecasting. The process of using a data mining tool (e.g., neural networks) to learn to predict temporal sequences of patterns, so that, given a set of patterns, it can predict a future value. Unsupervised learning. A data analysis technique whereby a model is built without a well defined goal or prediction field. The systems are used for exploration and general data organization. Clustering is an example of an unsupervised learning system. Visualization. Graphical display of data and models which helps the user in understanding the structure and meaning of the information contained in them. 1http://www.thearling.com/glossary.htm

Documents

HIM 6665 Healthcare Data Mining and Predictive Analytics€¦ · ealthcare Data Mining and Predictive Analytics (HIM 6665) is a graduate-level course designed to introduce students