REPORT ON ADVANCED ANALYTICAL TOOLS FOR PEST AND … · Bangalore “Advanced Analytical Tools for Pest and Disease Prediction Model” ... covering the theoretical part of the subjects

REPORT ON ADVANCED ANALYTICAL

TOOLS FOR PEST AND DISEASE PREDICTION MODEL

USING R 8 DAYS ONLINE HANDS ON TRAINING PROGRAMME

HELD FROM 5TH TO 12TH JUNE 2020

University of Agricultural Sciences, GKVK,

Bangalore

Importance of Advanced Analytical Tools:

Advanced analytics is an umbrella term for several sub-fields of analytics that work together

using predictive capabilities. It uses high-level methods and tools to project future trends, events, and

behaviors. Advanced analytics also include newer technologies such as Machine learning and

Artificial Intelligence, Semantic analysis, Visualizations, and even Neural Networks.

The online training program organized was one of the first of its kind which included the use of

R software that can be used in predicting and forecasting of the occurrence of pest and diseases so that

we can reduce the crop loss and also increase our productivity. This online training apart from

covering the theoretical part of the subjects also focused on the practical usage of this tool. It was

beneficial in research activities at both organizational level and for the research scholars and students

in analyzing the research data and also publishing the research papers of high quality.

Based on this to inculcate the skills to the students working on the relevant subject, this 8 days

training programme titled “Advanced Analytical Tools for Pest and Disease Prediction Model ”

was held from 5th to 12thJune 2020.

Resource persons: The professors and guest faculty, well-worse in analytics were invited as guest

lecturers for the short course viz.

Dr. Lalith Achuth, Professor (retd.), Dairy Science college, Hebbal,

Dr. K. B. Vedamurthy, Assistant Professor Dairy Science College, Hebbal, Bengaluru,

Dr. E. Edwin Raj, SRF, Plant Pathology Division, ICAR- NRC for Banana, Tamil Nadu

“Advanced Analytical Tools for Pest and Disease Prediction Model” - 8 days online hands on training

programme held from 5th to 12thJune 2020.

Dr.Y.B.Srinivasa, Managing Director Tene Agril. Solutions, Pvt. Ltd., Bengaluru,

Dr. Mohammed Ahamed, J., Scientist RRSC-South ISRO, Bengaluru.

Participants: Total of 211 applications were received of which 20 were selected based on their prior

knowledge on R program and Modeling techniques. And these 20 (participants) belonged from

different disciplines viz., Agronomy, Genetics and Plant Breeding, Plant Pathology and Agricultural

Entomology, Nematology, Agrometerology, and Agricultural Microbiology. As this was an online

training program diversity of participants from different parts of India, like Uttar Pradesh, Madhya

Pradesh, Tamil Nadu, Andhra Pradesh and Karnataka was included.

Schedule of the training programme:

Date and time Speaker/ trainer Topics

05.06.2020 4:00 pm to 6:00 pm

Dr. K. B. Vedamurthy, Assistant Professor Dairy Science College, Hebbal, Bengaluru Email.ID.: [email protected] , 8073070288 (M)

Logistic regression for pest and disease prediction in R

06.06.2020 4:00 pm to 6:00 pm


Introduction to machine learning techniques: Decision Tree for pest and disease prediction

07.06.2020 4:00 pm to 6:00 pm

Dr. E. Edwin Raj, Senior Research Fellow Plant Pathology Division, ICAR-NRC for Banana, Tiruchirappalli-620102 Tamil Nadu 9486646720(M) [email protected]

Weather based rice blast disease prediction using Machine learning approach in R program "caret package"

08.06.2020 10:30 am to 12:00 pm

Dr. Y. B. Srinivasa, Managing Director Tene Agril. Solutions, Pvt. Ltd., Bengaluru, Email ID.: [email protected] , 9901399939 (M)

Pest prediction: the importance of sustained generation of ground data

09.06.2020 3:00 pm to 5:00 pm

Dr. Lalith Achoth, Professor (Retd.) Dairy Science College, Hebbal, Bengaluru Email.ID.: [email protected] , 8088746133 (M)

Machine learning approach: Random Forest technique for pest and disease prediction

10.06.2020 4:00 pm to 6:00 pm


Machine learning approach: Neural Network technique for pest and disease prediction

11.06.2020 11:00 am to 12:00 pm

Dr. Y. B. Srinivasa, Managing Director Tene Agril. Solutions, Pvt. Ltd., Bengaluru, Email ID.: [email protected] , 9901399939 (M)

YAKSHA mobile application for pest and disease prediction and advisories to farmers

11.06.2020 2:00 pm to 4:00 pm


Machine learning approach: Support Vector Machine (SVM) technique for pest and disease prediction

12.06.2020 09:00 am to 11:00 am


VISUALISATION : Diff. types of graphs, Heat maps, Geo-Spacial maps and Introduction to Text Mining

12.06.2020 11:00 am to 12:00 pm

Dr. E. Edwin Raj, Senior Research Fellow Plant Pathology Division, ICAR-NRC for Banana, Tiruchirappalli-620102 Tamil Nadu 9486646720(M) [email protected]

Weather based stem borer pest prediction using Machine learning approach in R program "caret package"

12.06.2020 3:00 pm to 4:00 pm

Dr. Mohammed Ahamed, J., Scientist RRSC-South ISRO, Bengaluru. Email.ID: [email protected] Mob. No.: 9000567378

Familiarization of satellite data for pest and disease prediction

12.06.2020 5:00 pm to 6:00 pm

Dr. S. Rajendra Prasad, Vice-Chancellor, Training Director and PI of CAAST-NGT_AA Project

Valedictory / Concluding session

mailto:[email protected]











Over the course of training Dr. K. B. Vedamurthy, Assistant Professor Dairy Science College,

Hebbal, Bengaluru, guided us through the Logistic Regression model for Pest and Disease prediction.

He also introduced us to Decision Tree technique using the R software. An elaborate session was also

held by him on the Support Vector Machine (SVM) technique for pest and disease prediction models.

Under his guidance we also learnt different visualization techniques and generation of different types

of graphs, Heat maps, and Geo-Spacial maps and he also introduced us to few of the Text Mining

techniques. Paddy Leaf Blast Disease and Grape Downey Mildew Disease data was used for learning

these advanced analytical skills for forecast model building.

Dr. Lalith Achoth, Professor (Retd.) Dairy Science College, Hebbal, Bengaluru enlightened us

on the Machine learning tool called Random Forest Technique, Rice leaf blast disease data was used

in explaining in detail about the model building technique.

Dr. Y. B. Srinivasa, Managing Director, Tene Agril. Solutions, Pvt. Ltd., Bengaluru explained

to us the importance of sustainable ground data generation and also took us through the mobile app

called YAKSHA which can be used in data collection and report generation at field level.

Dr. E. Edwin Raj, Senior Research Fellow Plant Pathology Division, ICAR-NRC for Banana,

Tiruchirappalli introduced us to a very useful package in R programming software called the ‘caret

(Classification and Regression Training)’ which contains functions to streamline the model training

process for complex regression and classification problems. He took up the weather based rice blast

disease data and stem borer pest incidence data as an example in explaining the package.

Dr. Mohammed Ahamed, J., Scientist RRSC-South ISRO, Bangalore introduced us about the

various open sources like Karnataka GIS, Giovanni, vedas.sat.got.com and Google earth engine for

collecting the primary data on the weather parameters.

At the valedictory program which was held on the last day of the training program Dr. S.

Rajendra Prasad, Vice-Chancellor, Training Director and PI of CAAST-NGT_AA Project addressed

the participants and the resource persons and congratulated our PI, Dr M K Prasanna Kumar sir on the

successful completion of the training programme, he also distributed the certificates to the participants

virtually.

Participant’s Feedback

Practical utilization of the Advanced Analytical Tools using R:

Based on the practical exposure to the different advanced analytical tools, the participants were asked

to provide the feedback of the training program as to how are they using these tools in their own

research domains, below are the description of the same:

Application of Logistic Regression for Disease Prediction

Introduction

Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No,

True / False) given a set of independent variables. To represent binary/categorical outcome, we use

dummy variables.

• Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model

(glm)

• GLM does not assume a linear relationship between dependent and independent variables.

However, it assumes a linear relationship between link function and independent variables in

logit model.

• The dependent variable need not to be normally distributed

• It does not use OLS (Ordinary Least Square) for parameter estimation. Instead, it uses

maximum likelihood estimation (MLE)

• Errors need to be independent but not normally distributed

OBJECTIVE

To predict the occurrence of disease under different weather scenarios

PERFORMANCE OF LOGISTIC REGRESSION MODEL

• To evaluate the performance of a logistic regression model, we must consider few metrics

• AIC (Akaike Information Criteria) – The analogous metric of adjusted R² in logistic

regression is AIC. AIC is the measure of fit which penalizes model for the number of model

coefficients. Therefore, we always prefer model with minimum AIC value.

• Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted values.

This helps us to find the accuracy of the model and avoid over fitting.

• This is how Confusion Matrix looks like:

1: Positives 0: Negatives

1: Positives TP FN

0: Negatives FP TN

From confusion matrix calculate;

Accuracy = (TP+TN) / (TP+FP+FN+TN)

Precision = TP / (TP+FP)

Specificity = TN / (TN + FP)

PREDICTED

ACTUALS

TPR / Recall / Sensitivity = TP / (TP + FN)

FPR = 1 - Specificity = FP / (TN + FP)

Receiver Operating Curve

Summarizes the model’s performance by evaluating the tradeoffs between true positive rate

(sensitivity) and false positive rate (1- specificity)

The ROC of a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch

the top left corner of the graph

PROCEDURE

R CODES FOR RUNNING BINOMINAL LOGISTIC REGRESSION (EX: RICE BLAST

DISEASE)

# Import the data

setwd("G:\\mango\\epimango")

b= read.csv("blast.csv")

View(b)

names(b)

str(b)

dim(b)

cor(b)

#To apply logistic regression dependent variable has to be binomial (1,0)

#get the mean values of PDI

mean(b$LBD )

summary(b)

#Categorise above and below the name

b$LBD = ifelse(b$LBD >= 11, 1,0)

View(b)

levels(b$LBD)

# Convert dependent variable into factor

b$LBD= as.factor(b$LBD)

#How many categories in Disease

levels(b$LBD)

#How many 0 and 1 in Disease_Cat

table(b$LBD)

#Any na in data

any(is.na(b))

names(b)

install.packages("caTools")

library(caTools)

#Create training and test samples

set.seed(100)

split = sample.split(b$LBD, SplitRatio = 0.70)

train = subset(b, split==TRUE)

test = subset(b, split==FALSE)

dim(train)

dim(test)

#testing for multicollinearity

library(car) #vif

model_1 = glm(LBD ~ ., data=train, family='binomial')

vif(model_1)

names(test)

prob = predict(model_1, test, type='response')

pred = ifelse(prob >= 0.5, 1,0)

cm = table(test$LBD,pred)

cm

table(test$LBD)

#performance metrics

cm[1]

cm[2]

cm[3]

cm[4]

accuracy = (cm[1]+cm[4])/dim(test)[1]*100

recall = cm[4]/(cm[2]+cm[4])*100 #tpr,sensitivity

precision = cm[4]/(cm[3]+cm[4])*100

accuracy

recall

precision

install.packages("ROCR")

library(ROCR)

ROCRpred <- prediction(pred, test$LBD)

ROCRperf <- performance(ROCRpred, 'tpr','fpr')

plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

RESULTS

> vif(model_1)#testing for multicollinearity SMW Tmax Tmin RHI RHII RF SSH 2.272553 3.412431 3.459593 1.225233 1.502263 1.250933 1.875297 > cm pred 0 1 0 66 1 1 3 26 > accuracy [1] 95.83333 > recall [1] 89.65517 > precision [1] 96.2963

ROC CURVE

MULTINOMINAL LOGISTIC REGRESSION (EX: RICE BLAST DISEASE)

R CODES FOR RUNNING


df <- read.csv("blast.csv", header=TRUE)

require(foreign)

require(nnet)

require(ggplot2)

require(reshape2)

names(df)

levels(df$df)

# splitting data into training and testing sample another way

# load the libraries

library(caret)

library(klaR)

library(MASS)

#Define a 80%/20% train/test split for the dataset

df$LBD = ifelse(df$LBD <= 10, 'Low',ifelse(df$LBD >=50, 'High','Medium'))

View(df)

df$LBD = as.factor(df$LBD)

table(df$LBD)

df$group1 = relevel(df$LBD, ref = "Low")

trainIndex <- createDataPartition(df$LBD, p=0.80, list=FALSE)

data_train <- df[ trainIndex,]

data_test <- df[-trainIndex,]

test.train <- step(multinom(group1 ~ ., data = data_train ))

summary(test.train)

df.predict <- predict(test.train, newdata = data_test, "probs")

View(df.predict)

View(data_test)

RESULTS

Table predicting raw data, where dependant variable is continuous

Conversion of continuous variable into categorical variable as Low, Medium, High

Accuracy

multinom(formula = group1 ~ LBD, data = data_train) Coefficients: (Intercept) LBDLow LBDMedium High 13.03020 -28.34113 -11.42825 Medium -2.64831 -19.83891 19.74321 Std. Errors: (Intercept) LBDLow LBDMedium High 159.1612 225.1221 723.8411 Medium 619.0964 5791.3033 893.4606 Residual Deviance: 0.0001922378 AIC: 12.00019

Prediction models developed during training programme for downy mildew of

grapes

Decision tree:

A decision tree is a decision support tool that uses a tree-like model of decisions and their

possible consequences, including chance event outcomes, resource costs, and utility. It is one way to

display an algorithm that only contains conditional control statements.

This algorithm was used during the training programme to develop the classification tree for downy

mildew of grapes with the help of R program.

R codes used for developing a tree

setwd("G:\\RStudio\\New folder")

getwd()

grape=read.csv("kiran.csv")

names(grape)

summary(grape$PDI)

grape$disease = ifelse(grape$PDI >13,1,0)

#grape$RF=ifelse(grape$RF <= 0, 0.00001,

grape$RF)

names(grape)

grape = grape[,c(3:11)]

names(grape)

grape$disease = as.factor(grape$disease)

class(grape$disease)

levels(grape$disease)

#Splitting the data into train and test

datasets

#install.packages("caTools")

library(caTools)

set.seed(100)

split = sample.split(grape$disease, SplitRatio =

0.70)

train = subset(grape, split==TRUE)

test = subset(grape, split==FALSE)

dim(train)

dim(test)

library(tree)

library(party)

tree.grape = ctree(disease ~., data=train)

summary(tree.grape)

plot(tree.grape)

This tree predicts the downy mildew that, if

the soil moisture is > 42.24 coupled with

the relative humidity of > 84.17, chances of

getting the disease is 100 per cent

Application of Neural networks for Disease Prediction

Out of several types of machine learning techniques, a neural network is a series of algorithms that

endeavors to recognize underlying relationships in a set of data through a process that mimics the way

the human brain operates. In this sense, neural networks refer to systems of neurons, either organic or

artificial in nature. Neural networks can adapt to changing input; so the network generates the best

possible result without needing to redesign the output criteria.

For the analysis below, survival data after breast operation from UC machine learning

repository was taken (https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival). The dependent

variable for neural network model was while, the independent variables were survival Age, Operation

year, Numb of positive axillary nodes detected. In the algorithm, one hidden layer was used with one

node. Since the dependent variable was categorical, logistic function was used as activation function.

The R codes for execution of the analysis are as below.

Data import and naming columns:

url <- 'http://archive.ics.uci.edu/ml/machine-learning-databases//haberman/haberman.data' Hab_Data <- read_csv(file = url, col_names = c('Age', 'Operation_Year', 'Number_Pos_Nodes','Survival')) %>% na.omit() %>% mutate(Survival = ifelse(Survival == 2, 0, 1), Survival = factor(Survival))

Classification of the dependent variable:

Hab_Data <- Hab_Data %>% mutate(Survival = as.integer(Survival) - 1, Survival = ifelse(Survival == 1, TRUE, FALSE))

Execution of the model:

set.seed(123) Hab_NN1 <- neuralnet(Survival ~ Age + Operation_Year + Number_Pos_Nodes, data = Hab_Data, linear.output = FALSE, err.fct = 'ce', likelihood = TRUE)

Plotting the neural network:

plot(Hab_NN1, rep = 'best')

Based on AIC and BIC best model was selected to represent and predict the survivability.

Results:

Based on AIC and BIC, the best model is that with one hidden layer and one node.

https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival

Application of Random Forest for Blast Disease Prediction in Paddy

Random forests are based on a simple idea: 'the wisdom of the crowd'. Aggregate of the results

of multiple predictors gives a better prediction than the best individual predictor. A group of predictors

is called an ensemble. Thus, this technique is called Ensemble Learning.

Codes for Random Forest

install.packages("randomForest")

library(randomForest)

Import the data


df = read.csv("blast.csv")

Train the model

# Split into Train and Validation sets

# Training Set: Validation Set = 70 : 30 (random)

set.seed(100)

train <- sample(nrow(df), 0.7*nrow(df), replace = FALSE)

TrainSet <- df[train,]

ValidSet <- df[-train,]

summary(TrainSet)

summary(ValidSet)

# Create a Random Forest model with default parameters

model1 <- randomForest(LBD ~ ., data = TrainSet, importance = TRUE)

model1

# Fine tuning parameters of Random Forest model

model2 <- randomForest(LBD ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)

model2

# Predicting on train set

predTrain <- predict(model1, TrainSet, type = "class")

# Checking classification accuracy

table(predTrain, TrainSet$LBD)

# Predicting on Validation set

predValid <- predict(model1, ValidSet, type = "class")

# Checking classification accuracy

mean(predValid == ValidSet$LBD)

table(predValid,ValidSet$LBD)

# To check important variables

importance(model1)

varImpPlot(model1)

varImpPlot(model2)

https://www.guru99.com/r-random-forest-tutorial.html#1

https://www.guru99.com/r-random-forest-tutorial.html#2

Result:

View(df) – to view the data

summary(TrainSet)

summary(ValidSet)

# error rate of random forest

rf <- randomForest(LBD ~ ., df = train)

print(rf)

plot(rf)

#Create a Random Forest model with default parameters

model1 <- randomForest(LBD ~ ., data = TrainSet, importance = TRUE)

model1 Call:

randomForest(formula = LBD ~ ., data = TrainSet, importance = TRUE)

Type of random forest: regression

Number of trees: 500

No. of variables tried at each split: 2

Mean of squared residuals: 70.12236

% Var explained: 73.9

# Fine tuning parameters of Random Forest model

model2 <- randomForest(LBD ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)

model2

Call:

randomForest(formula = LBD ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)

Type of random forest: regression

Number of trees: 500

No. of variables tried at each split: 6

Mean of squared residuals: 54.79684

% Var explained: 79.6

# # To check important variables

importance(model1)

varImpPlot(model1)

varImpPlot(model2)

importance(model1)

%IncMSE IncNodePurity

SMW 10.892757 4827.686

Tmax 28.761607 13338.626

Tmin 36.652896 19544.818

RHI 8.506638 6938.999

RHII 10.454625 5863.694

RF 9.129968 3105.288

SSH 3.785933 3589.118

varImpPlot(model1)

varImpPlot(model2)

Documents

REPORT ON ADVANCED ANALYTICAL TOOLS FOR PEST AND … · Bangalore “Advanced Analytical Tools for Pest and Disease Prediction Model” ... covering the theoretical part of the subjects