Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
REPORT ON ADVANCED ANALYTICAL
TOOLS FOR PEST AND DISEASE PREDICTION MODEL
USING R 8 DAYS ONLINE HANDS ON TRAINING PROGRAMME
HELD FROM 5TH TO 12TH JUNE 2020
University of Agricultural Sciences, GKVK,
Bangalore
Importance of Advanced Analytical Tools:
Advanced analytics is an umbrella term for several sub-fields of analytics that work together
using predictive capabilities. It uses high-level methods and tools to project future trends, events, and
behaviors. Advanced analytics also include newer technologies such as Machine learning and
Artificial Intelligence, Semantic analysis, Visualizations, and even Neural Networks.
The online training program organized was one of the first of its kind which included the use of
R software that can be used in predicting and forecasting of the occurrence of pest and diseases so that
we can reduce the crop loss and also increase our productivity. This online training apart from
covering the theoretical part of the subjects also focused on the practical usage of this tool. It was
beneficial in research activities at both organizational level and for the research scholars and students
in analyzing the research data and also publishing the research papers of high quality.
Based on this to inculcate the skills to the students working on the relevant subject, this 8 days
training programme titled “Advanced Analytical Tools for Pest and Disease Prediction Model ”
was held from 5th to 12thJune 2020.
Resource persons: The professors and guest faculty, well-worse in analytics were invited as guest
lecturers for the short course viz.
Dr. Lalith Achuth, Professor (retd.), Dairy Science college, Hebbal,
Dr. K. B. Vedamurthy, Assistant Professor Dairy Science College, Hebbal, Bengaluru,
Dr. E. Edwin Raj, SRF, Plant Pathology Division, ICAR- NRC for Banana, Tamil Nadu
“Advanced Analytical Tools for Pest and Disease Prediction Model” - 8 days online hands on training
programme held from 5th to 12thJune 2020.
Dr.Y.B.Srinivasa, Managing Director Tene Agril. Solutions, Pvt. Ltd., Bengaluru,
Dr. Mohammed Ahamed, J., Scientist RRSC-South ISRO, Bengaluru.
Participants: Total of 211 applications were received of which 20 were selected based on their prior
knowledge on R program and Modeling techniques. And these 20 (participants) belonged from
different disciplines viz., Agronomy, Genetics and Plant Breeding, Plant Pathology and Agricultural
Entomology, Nematology, Agrometerology, and Agricultural Microbiology. As this was an online
training program diversity of participants from different parts of India, like Uttar Pradesh, Madhya
Pradesh, Tamil Nadu, Andhra Pradesh and Karnataka was included.
Schedule of the training programme:
Date and time Speaker/ trainer Topics
05.06.2020 4:00 pm to 6:00 pm
Dr. K. B. Vedamurthy, Assistant Professor Dairy Science College, Hebbal, Bengaluru Email.ID.: [email protected] , 8073070288 (M)
Logistic regression for pest and disease prediction in R
06.06.2020 4:00 pm to 6:00 pm
Dr. K. B. Vedamurthy, Assistant Professor Dairy Science College, Hebbal, Bengaluru Email.ID.: [email protected] , 8073070288 (M)
Introduction to machine learning techniques: Decision Tree for pest and disease prediction
07.06.2020 4:00 pm to 6:00 pm
Dr. E. Edwin Raj, Senior Research Fellow Plant Pathology Division, ICAR-NRC for Banana, Tiruchirappalli-620102 Tamil Nadu 9486646720(M) [email protected]
Weather based rice blast disease prediction using Machine learning approach in R program "caret package"
08.06.2020 10:30 am to 12:00 pm
Dr. Y. B. Srinivasa, Managing Director Tene Agril. Solutions, Pvt. Ltd., Bengaluru, Email ID.: [email protected] , 9901399939 (M)
Pest prediction: the importance of sustained generation of ground data
09.06.2020 3:00 pm to 5:00 pm
Dr. Lalith Achoth, Professor (Retd.) Dairy Science College, Hebbal, Bengaluru Email.ID.: [email protected] , 8088746133 (M)
Machine learning approach: Random Forest technique for pest and disease prediction
10.06.2020 4:00 pm to 6:00 pm
Dr. K. B. Vedamurthy, Assistant Professor Dairy Science College, Hebbal, Bengaluru Email.ID.: [email protected] , 8073070288 (M)
Machine learning approach: Neural Network technique for pest and disease prediction
11.06.2020 11:00 am to 12:00 pm
Dr. Y. B. Srinivasa, Managing Director Tene Agril. Solutions, Pvt. Ltd., Bengaluru, Email ID.: [email protected] , 9901399939 (M)
YAKSHA mobile application for pest and disease prediction and advisories to farmers
11.06.2020 2:00 pm to 4:00 pm
Dr. K. B. Vedamurthy, Assistant Professor Dairy Science College, Hebbal, Bengaluru Email.ID.: [email protected] , 8073070288 (M)
Machine learning approach: Support Vector Machine (SVM) technique for pest and disease prediction
12.06.2020 09:00 am to 11:00 am
Dr. K. B. Vedamurthy, Assistant Professor Dairy Science College, Hebbal, Bengaluru Email.ID.: [email protected] , 8073070288 (M)
VISUALISATION : Diff. types of graphs, Heat maps, Geo-Spacial maps and Introduction to Text Mining
12.06.2020 11:00 am to 12:00 pm
Dr. E. Edwin Raj, Senior Research Fellow Plant Pathology Division, ICAR-NRC for Banana, Tiruchirappalli-620102 Tamil Nadu 9486646720(M) [email protected]
Weather based stem borer pest prediction using Machine learning approach in R program "caret package"
12.06.2020 3:00 pm to 4:00 pm
Dr. Mohammed Ahamed, J., Scientist RRSC-South ISRO, Bengaluru. Email.ID: [email protected] Mob. No.: 9000567378
Familiarization of satellite data for pest and disease prediction
12.06.2020 5:00 pm to 6:00 pm
Dr. S. Rajendra Prasad, Vice-Chancellor, Training Director and PI of CAAST-NGT_AA Project
Valedictory / Concluding session
Over the course of training Dr. K. B. Vedamurthy, Assistant Professor Dairy Science College,
Hebbal, Bengaluru, guided us through the Logistic Regression model for Pest and Disease prediction.
He also introduced us to Decision Tree technique using the R software. An elaborate session was also
held by him on the Support Vector Machine (SVM) technique for pest and disease prediction models.
Under his guidance we also learnt different visualization techniques and generation of different types
of graphs, Heat maps, and Geo-Spacial maps and he also introduced us to few of the Text Mining
techniques. Paddy Leaf Blast Disease and Grape Downey Mildew Disease data was used for learning
these advanced analytical skills for forecast model building.
Dr. Lalith Achoth, Professor (Retd.) Dairy Science College, Hebbal, Bengaluru enlightened us
on the Machine learning tool called Random Forest Technique, Rice leaf blast disease data was used
in explaining in detail about the model building technique.
Dr. Y. B. Srinivasa, Managing Director, Tene Agril. Solutions, Pvt. Ltd., Bengaluru explained
to us the importance of sustainable ground data generation and also took us through the mobile app
called YAKSHA which can be used in data collection and report generation at field level.
Dr. E. Edwin Raj, Senior Research Fellow Plant Pathology Division, ICAR-NRC for Banana,
Tiruchirappalli introduced us to a very useful package in R programming software called the ‘caret
(Classification and Regression Training)’ which contains functions to streamline the model training
process for complex regression and classification problems. He took up the weather based rice blast
disease data and stem borer pest incidence data as an example in explaining the package.
Dr. Mohammed Ahamed, J., Scientist RRSC-South ISRO, Bangalore introduced us about the
various open sources like Karnataka GIS, Giovanni, vedas.sat.got.com and Google earth engine for
collecting the primary data on the weather parameters.
At the valedictory program which was held on the last day of the training program Dr. S.
Rajendra Prasad, Vice-Chancellor, Training Director and PI of CAAST-NGT_AA Project addressed
the participants and the resource persons and congratulated our PI, Dr M K Prasanna Kumar sir on the
successful completion of the training programme, he also distributed the certificates to the participants
virtually.
Participant’s Feedback
Practical utilization of the Advanced Analytical Tools using R:
Based on the practical exposure to the different advanced analytical tools, the participants were asked
to provide the feedback of the training program as to how are they using these tools in their own
research domains, below are the description of the same:
Application of Logistic Regression for Disease Prediction
Introduction
Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No,
True / False) given a set of independent variables. To represent binary/categorical outcome, we use
dummy variables.
• Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model
(glm)
• GLM does not assume a linear relationship between dependent and independent variables.
However, it assumes a linear relationship between link function and independent variables in
logit model.
• The dependent variable need not to be normally distributed
• It does not use OLS (Ordinary Least Square) for parameter estimation. Instead, it uses
maximum likelihood estimation (MLE)
• Errors need to be independent but not normally distributed
OBJECTIVE
To predict the occurrence of disease under different weather scenarios
PERFORMANCE OF LOGISTIC REGRESSION MODEL
• To evaluate the performance of a logistic regression model, we must consider few metrics
• AIC (Akaike Information Criteria) – The analogous metric of adjusted R² in logistic
regression is AIC. AIC is the measure of fit which penalizes model for the number of model
coefficients. Therefore, we always prefer model with minimum AIC value.
• Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted values.
This helps us to find the accuracy of the model and avoid over fitting.
• This is how Confusion Matrix looks like:
1: Positives 0: Negatives
1: Positives TP FN
0: Negatives FP TN
From confusion matrix calculate;
Accuracy = (TP+TN) / (TP+FP+FN+TN)
Precision = TP / (TP+FP)
Specificity = TN / (TN + FP)
PREDICTED
ACTUALS
TPR / Recall / Sensitivity = TP / (TP + FN)
FPR = 1 - Specificity = FP / (TN + FP)
Receiver Operating Curve
Summarizes the model’s performance by evaluating the tradeoffs between true positive rate
(sensitivity) and false positive rate (1- specificity)
The ROC of a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch
the top left corner of the graph
PROCEDURE
R CODES FOR RUNNING BINOMINAL LOGISTIC REGRESSION (EX: RICE BLAST
DISEASE)
# Import the data
setwd("G:\\mango\\epimango")
b= read.csv("blast.csv")
View(b)
names(b)
str(b)
dim(b)
cor(b)
#To apply logistic regression dependent variable has to be binomial (1,0)
#get the mean values of PDI
mean(b$LBD )
summary(b)
#Categorise above and below the name
b$LBD = ifelse(b$LBD >= 11, 1,0)
View(b)
levels(b$LBD)
# Convert dependent variable into factor
b$LBD= as.factor(b$LBD)
#How many categories in Disease
levels(b$LBD)
#How many 0 and 1 in Disease_Cat
table(b$LBD)
#Any na in data
any(is.na(b))
names(b)
install.packages("caTools")
library(caTools)
#Create training and test samples
set.seed(100)
split = sample.split(b$LBD, SplitRatio = 0.70)
train = subset(b, split==TRUE)
test = subset(b, split==FALSE)
dim(train)
dim(test)
#testing for multicollinearity
library(car) #vif
model_1 = glm(LBD ~ ., data=train, family='binomial')
vif(model_1)
names(test)
prob = predict(model_1, test, type='response')
pred = ifelse(prob >= 0.5, 1,0)
cm = table(test$LBD,pred)
cm
table(test$LBD)
#performance metrics
cm[1]
cm[2]
cm[3]
cm[4]
accuracy = (cm[1]+cm[4])/dim(test)[1]*100
recall = cm[4]/(cm[2]+cm[4])*100 #tpr,sensitivity
precision = cm[4]/(cm[3]+cm[4])*100
accuracy
recall
precision
install.packages("ROCR")
library(ROCR)
ROCRpred <- prediction(pred, test$LBD)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))
RESULTS
> vif(model_1)#testing for multicollinearity SMW Tmax Tmin RHI RHII RF SSH 2.272553 3.412431 3.459593 1.225233 1.502263 1.250933 1.875297 > cm pred 0 1 0 66 1 1 3 26 > accuracy [1] 95.83333 > recall [1] 89.65517 > precision [1] 96.2963
ROC CURVE
MULTINOMINAL LOGISTIC REGRESSION (EX: RICE BLAST DISEASE)
R CODES FOR RUNNING
setwd("G:\\mango\\epimango")
df <- read.csv("blast.csv", header=TRUE)
require(foreign)
require(nnet)
require(ggplot2)
require(reshape2)
names(df)
levels(df$df)
# splitting data into training and testing sample another way
# load the libraries
library(caret)
library(klaR)
library(MASS)
#Define a 80%/20% train/test split for the dataset
df$LBD = ifelse(df$LBD <= 10, 'Low',ifelse(df$LBD >=50, 'High','Medium'))
View(df)
df$LBD = as.factor(df$LBD)
table(df$LBD)
df$group1 = relevel(df$LBD, ref = "Low")
trainIndex <- createDataPartition(df$LBD, p=0.80, list=FALSE)
data_train <- df[ trainIndex,]
data_test <- df[-trainIndex,]
test.train <- step(multinom(group1 ~ ., data = data_train ))
summary(test.train)
df.predict <- predict(test.train, newdata = data_test, "probs")
View(df.predict)
View(data_test)
RESULTS
Table predicting raw data, where dependant variable is continuous
Conversion of continuous variable into categorical variable as Low, Medium, High
Accuracy
multinom(formula = group1 ~ LBD, data = data_train) Coefficients: (Intercept) LBDLow LBDMedium High 13.03020 -28.34113 -11.42825 Medium -2.64831 -19.83891 19.74321 Std. Errors: (Intercept) LBDLow LBDMedium High 159.1612 225.1221 723.8411 Medium 619.0964 5791.3033 893.4606 Residual Deviance: 0.0001922378 AIC: 12.00019
Prediction models developed during training programme for downy mildew of
grapes
Decision tree:
A decision tree is a decision support tool that uses a tree-like model of decisions and their
possible consequences, including chance event outcomes, resource costs, and utility. It is one way to
display an algorithm that only contains conditional control statements.
This algorithm was used during the training programme to develop the classification tree for downy
mildew of grapes with the help of R program.
R codes used for developing a tree
setwd("G:\\RStudio\\New folder")
getwd()
grape=read.csv("kiran.csv")
names(grape)
summary(grape$PDI)
grape$disease = ifelse(grape$PDI >13,1,0)
#grape$RF=ifelse(grape$RF <= 0, 0.00001,
grape$RF)
names(grape)
grape = grape[,c(3:11)]
names(grape)
grape$disease = as.factor(grape$disease)
class(grape$disease)
levels(grape$disease)
#Splitting the data into train and test
datasets
#install.packages("caTools")
library(caTools)
set.seed(100)
split = sample.split(grape$disease, SplitRatio =
0.70)
train = subset(grape, split==TRUE)
test = subset(grape, split==FALSE)
dim(train)
dim(test)
library(tree)
library(party)
tree.grape = ctree(disease ~., data=train)
summary(tree.grape)
plot(tree.grape)
This tree predicts the downy mildew that, if
the soil moisture is > 42.24 coupled with
the relative humidity of > 84.17, chances of
getting the disease is 100 per cent
Application of Neural networks for Disease Prediction
Out of several types of machine learning techniques, a neural network is a series of algorithms that
endeavors to recognize underlying relationships in a set of data through a process that mimics the way
the human brain operates. In this sense, neural networks refer to systems of neurons, either organic or
artificial in nature. Neural networks can adapt to changing input; so the network generates the best
possible result without needing to redesign the output criteria.
For the analysis below, survival data after breast operation from UC machine learning
repository was taken (https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival). The dependent
variable for neural network model was while, the independent variables were survival Age, Operation
year, Numb of positive axillary nodes detected. In the algorithm, one hidden layer was used with one
node. Since the dependent variable was categorical, logistic function was used as activation function.
The R codes for execution of the analysis are as below.
Data import and naming columns:
url <- 'http://archive.ics.uci.edu/ml/machine-learning-databases//haberman/haberman.data' Hab_Data <- read_csv(file = url, col_names = c('Age', 'Operation_Year', 'Number_Pos_Nodes','Survival')) %>% na.omit() %>% mutate(Survival = ifelse(Survival == 2, 0, 1), Survival = factor(Survival))
Classification of the dependent variable:
Hab_Data <- Hab_Data %>% mutate(Survival = as.integer(Survival) - 1, Survival = ifelse(Survival == 1, TRUE, FALSE))
Execution of the model:
set.seed(123) Hab_NN1 <- neuralnet(Survival ~ Age + Operation_Year + Number_Pos_Nodes, data = Hab_Data, linear.output = FALSE, err.fct = 'ce', likelihood = TRUE)
Plotting the neural network:
plot(Hab_NN1, rep = 'best')
Based on AIC and BIC best model was selected to represent and predict the survivability.
Results:
Based on AIC and BIC, the best model is that with one hidden layer and one node.
Application of Random Forest for Blast Disease Prediction in Paddy
Random forests are based on a simple idea: 'the wisdom of the crowd'. Aggregate of the results
of multiple predictors gives a better prediction than the best individual predictor. A group of predictors
is called an ensemble. Thus, this technique is called Ensemble Learning.
Codes for Random Forest
install.packages("randomForest")
library(randomForest)
Import the data
setwd("G:\\mango\\epimango")
df = read.csv("blast.csv")
Train the model
# Split into Train and Validation sets
# Training Set: Validation Set = 70 : 30 (random)
set.seed(100)
train <- sample(nrow(df), 0.7*nrow(df), replace = FALSE)
TrainSet <- df[train,]
ValidSet <- df[-train,]
summary(TrainSet)
summary(ValidSet)
# Create a Random Forest model with default parameters
model1 <- randomForest(LBD ~ ., data = TrainSet, importance = TRUE)
model1
# Fine tuning parameters of Random Forest model
model2 <- randomForest(LBD ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
model2
# Predicting on train set
predTrain <- predict(model1, TrainSet, type = "class")
# Checking classification accuracy
table(predTrain, TrainSet$LBD)
# Predicting on Validation set
predValid <- predict(model1, ValidSet, type = "class")
# Checking classification accuracy
mean(predValid == ValidSet$LBD)
table(predValid,ValidSet$LBD)
# To check important variables
importance(model1)
varImpPlot(model1)
varImpPlot(model2)
Result:
View(df) – to view the data
summary(TrainSet)
summary(ValidSet)
# error rate of random forest
rf <- randomForest(LBD ~ ., df = train)
print(rf)
plot(rf)
#Create a Random Forest model with default parameters
model1 <- randomForest(LBD ~ ., data = TrainSet, importance = TRUE)
model1 Call:
randomForest(formula = LBD ~ ., data = TrainSet, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 2
Mean of squared residuals: 70.12236
% Var explained: 73.9
# Fine tuning parameters of Random Forest model
model2 <- randomForest(LBD ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
model2
Call:
randomForest(formula = LBD ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 6
Mean of squared residuals: 54.79684
% Var explained: 79.6
# # To check important variables
importance(model1)
varImpPlot(model1)
varImpPlot(model2)
importance(model1)
%IncMSE IncNodePurity
SMW 10.892757 4827.686
Tmax 28.761607 13338.626
Tmin 36.652896 19544.818
RHI 8.506638 6938.999
RHII 10.454625 5863.694
RF 9.129968 3105.288
SSH 3.785933 3589.118
varImpPlot(model1)
varImpPlot(model2)