Data Mining Project-Predicting Injury or Fatality in case of an accident

FATAL OR INJURY- A CASE OF DECIDING ON PRIORITIZING RESPONDER RESOURCES

ByPiyush Lohana

Maximum accidents in the year 2007 happened due to motor vehicles.

WHY THIS PROJECT

• “Every 12 minutes someone dies in a car crash in the United States due to a car accident or a collision between two motor vehicles.” (-NCIPC)

• Most of times the accidents are fatal or involve serious injuries and by the time the help arrives at the crash site, a lot of loss has been done.

• We attempt to build a model that can predict the seriousness of an accident case (i.e. if an accident is fatal or results in injury) based on the various predictors like rush or no rush hour, work zone, weather conditions, speed limits, interstate etc.

• This helps to prioritize situations and allocates resources in scenarios where there is a high possibility of an accident resulting in fatalities or serious injury.

• This will enable the emergency care provider on focusing on the measures and resource that can be taken when they arrive at the scene. The accuracy of pre-hospital crash scene details and crash victim assessment has important implications on the care that can be provided at the time of the crash scene.

WHAT ARE WE CONSIDERING

• We will be looking at the characteristics of the environment in which the accident occurred (weather, road condition, type of road, time of day, the day of the week, and month of the year) and the characteristics of the crash (direction of accident, speed limit on the road, work zone area, and how many vehicles were involved).

• All of these variables can effect in what kind of accident has occurred (no injury, injury or fatal). This can further help the medic’s team to come prepared for the necessary actions that need to be taken at the scene.

DATA SOURCE

• http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=1158

• It has 24 different attributes and 42,183 records

• Identified Predictor and Outcome Variables

http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=1158

CLEAR DESCRIPTION OF DATA SETSl. No Variables Description

1 HOUR_I_R 1=rush hour, 0=not (rush = 6-9 am, 4-7 pm)2 ALIGN_I 1 = straight, 2 = curve3

STRATUM_R

1= NASS Crashes Involving At Least One Passenger Vehicle towed due to damage from the crash scene and no medium or heavy trucks are Involved, 0=not

4 WRK_ZONE 1= yes, 0= no5 WKDY_I_R 1=weekday, 0=weekend6 INT_HWY Interstate? 1=yes, 0=no7

LGTCON_I_RLight conditions - 1=day, 2=dark (including dawn/dusk), 3=dark, but lighted,4=dawn or dusk

8 MAN_COL_I 0=no collision, 1=head-on, 2=other form of collision9 PED_ACC_R 1=pedestrian/cyclist involved, 0=not10

REL_JCT_I_R1=accident at intersection/interchange, 0=not at intersection

CLEAR DESCRIPTION OF DATA SETSl. No Variables Description

11 SPD_LIM Speed limit, miles per hour 12

SUR_CONSurface conditions (1=dry, 2=wet, 3=snow/slush, 4=ice,

5=sand/dirt/oil, 8=other, 9=unknown)13 TRAF_WAY 1=two-way traffic, 2=divided hwy, 3=one-way road14 VEH_INVL Number of vehicles involved15

WEATHER_R1=no adverse conditions, 2= rain, snow or other adverse

condition16 INJURY_CRASH 1=yes, 0= no17 NO_INJ_I Number of injuries18 FATALITIES 1= yes, 0= no19 MAX_SEV_IR 0=no injury, 1=non-fatal inj., 2=fatal inj.

FILTERING DATA

• Filtering method used is "Standard Deviations from the Mean",

• This will eliminate the observations that are farther than three standard deviations from their means.

DATA PARTITIONING

• We build the model with Training Data• Test its correctness with Test Data• Validate it with Validation Data

PREDICT, CLASSIFY OR CLUSTER ?

As we are trying to predict the categorical class label MAX_SER_INJ, our analysis is supervised classification.

Our model intends to discover relationships between the attributes that would make it possible to predict the outcome variable.

MODELThe following three models are used for our analysis

• Memory Based Reasoning(MBR)

• Decision Trees

• Logistic Regression

FINAL MODEL

RESULTS AND DISCUSSION

BASELINE MISCLASSIFICATION• MAX_SEV_IR - 0=no injury, 1=non-fatal inj., 2=fatal inj.

• Class 0 (No injury): 4949

• Class 1(Non-fatal injury): 4900

• Class 2 (Fatal Injury): 150

• The majority class is 0 (No injury)

• The percentage of majority class in the dataset is: 49.49 % (4949/9999)

• The baseline misclassification rate: 50.51 %• This is the baseline, the model that we build will make any sense if its

misclassification rate is less than baseline misclassification.

OUR DEFINITION OF BEST MODEL AS PER BUSINESS REQUIREMENT

• Decision Tree : A supervised learning data driven method for classification

• It is based on separating observations into more homogeneous subgroups by creating splits on predictors.

• As Per our business requirement , this model is best in classifying the event of accident into three cases to prioritize resources.

RESULTS

The _MISC_ Misclassification rate :

• Training: 0.40945

• Validation: 0.4113

• Test: 0.42305

NODE RULES

INTERPRETATION AND IMPLEMENTATION

• Based on this rules, an application/website can be created which upon entering all the 5 most important factors(Predictors) will give an idea of the percentage of chances of an accident resulting in Fatality/Injury/No Injury.

• The emergency service provider can then take a decision and send the response team to the site of an accident accordingly.

BLUE PRINT OF IMPLEMENTATION

OUTCOME

• Depending on the Node Rule, it will predict the outcome

• Red Cross predict’s there are 80% chances of Injury

• Red Cross predict’s there are 10 % chances of Fatality

• Red Cross predict’s there are 10 % chances of No injury

SCOPE FOR IMPROVEMENT

• In order to build more focused and rigorous model, we are working on identifying more predictors that can help determine the status of accident and a more clean model that has a less misclassification.

• In order to achieve this, we intend to try Neural Network data mining algorithm.

THANK YOU

Engineering

Data Mining Project-Predicting Injury or Fatality in case of an accident