Aspiring Minds | Automata

Aspiring Minds

www.aspiringminds.com

Grading Programs using Machine Learning

Varun Aggarwal

Presented at KDD, 2014

Programming Assessments: Existing solutions

• Manual evaluation: Can’t scale; not standardized

• Test-case based evaluation:

• High false-positives – hard code, inadvertent errors

• High false-negatives – correct code but not efficient

• Similarity metric between control flow graphs, syntax trees:

• Need to handle multiple correct implementations – theoretically doesn’t fit in

• No mapping of metric to an objective feedback

Automatic grading of programs– Why?

• Widely performed - will help professors and TAs save a lot of time.

• Companies can recruit efficiently

• MOOCs - need automated open response assessments to really make it effective. True scaling of such system currently not achieved.

A model to predict the logical correctness of a program, given the control and data

dependencies it possesses.

Our Approach

Automata – Automatic program evaluation engine

Machine Learning based scoring engine

Evaluation of programming best practices

Asymptotic complexity evaluation

Lint-styled rule-based system to detect programs not following programming best

practices.

Measures the run-time of the code for various input sizes

and empirically derives the complexity.

Why programming modules give a better test-shortlist rate ?

• Programming has more predictive power in identifying good performers than Logical ability.

• Due to lower predictive power of Logical, a higher cut-off has to be applied to it as compared to Programming to get the same organizational efficiency.

• Higher the Programming capability of the person, requirement on Logical score is lesser.

• Given the person is lower than a given score on Programming, even having a higher logical ability does not help.

Evaluation Rubric

ML based scoring

Understanding the human process

Problem and Language independent

Features

Machine learning model

Ungraded programs

Graded programs

Predicted grades

5

Evaluation Rubric

Our Approach



Features


Ungraded programs

Graded programs

Predicted grades

1

Evaluation Rubric

Our Approach



Features


Ungraded programs

Graded programs

Predicted grades

2

Evaluation Rubric

Our Approach



Features


Ungraded programs

Graded programs

Predicted grades

3

void print_1(int N){ for(i =1 ; i<=N; i++){ print newline;

count = i; for(j=0; j<i; j++) print count;

count++; }}

12 33 4 54 5 6 7

OBJECTIVE To print the pattern of integers

An implementation

1. Are there loops? Are there print statements?

3. Is the conditional in the inner loop dependent on - a variable modified in the outer loop? - a variable used in the conditional of the outer

loop?

What does a grader look for?

2. Is there a nested-loop structure?

Grammar for expressing features• Simple features

• Keywords and Tokens (Counts):

• Tokens like for, if, return, break; function calls like printf, strrev, strcat; declarations like int, char• Operators like various arithmetic, logical, relational operators used• Character constants like ‘\0’, ‘ ’, ‘65’, ‘96’

• Capturing logical constructs (Interactions)

• Control flow structure

• Data-dependencies

• Data-dependencies in context of control-flow

CONTROL FEATURES – COUNTS

Counts of control-related keywords/tokens Ex. count(for) = 2

count(for-in-for) = 1 count(while) = 0

Control-context of these keywords- The Print command as loop(loop(print)))

for(i =1 ; i<=N; i++){

print newline;

count = i;

CONTROL FLOW GRAPH

i = 1

i <= N

i++

j < i

count = ij = 0

print(count)count++

j++

END

Loop 1

Loop 2

Parent scope

for(j=0; j<i; j++)

print count; count++;

void print(int N){

}

}

TARGET PROGRAM

DATA OPERATION FEATURES IN CONTROL-CONTEXT

Counts of data-related tokens in context of the control structureEx. count(block1 :loop(loop(++))) = 2

count(block1 :loop(loop_cond(<))) = 1

Capture control-context of data-dependencies in groups of expressions

i++ j < i : var (i) related to var (j) : appearing in a loop(loop_cond) previously incremented : appearing in a loop The relation and the increment happen in the same block

Loop 1

Loop 2

Loop 1

Loop 1

Loop 2

Loop 1

Loop 2

Loop 1

i = 1

i <= N

print(count)

count = ii++

j < i

count = 0

count++

j= 0

j++

Parent scope Parent scope

Loop 1

Loop 1

Loop 2

CONTROL FLOW INFORMATION ANNOTATED IN A-D-D GRAPH

Loop 1

• Deployed Automata in a major product-based company’s recruitment

• Analyzed the performance improvement in using Automata over test-case pass based selection criterion

• 22.6% candidates who were not being shortlisted through test-case pass were now shortlisted using Automata.

Case study

Experimental Results

Sort Problem

Doing it the one-class way!

PROBLEM All features Basic featuresMean Min25 Mean Min25

1 0.57 0.61 0.52 0.562 0.80 0.83 0.72 0.75

3 0.75 0.81 0.59 0.734 0.81 0.81 0.75 0.755 0.68 0.69 0.55 0.61

Betters test-case in all, but one

How good is the final ML-based score?

Validation Correlation >= 0.79Matches Inter-rater Correlation between two human raters

PROBLEM # of features Cross-val correl Train correl Validation correl Test Case Score

1 80 0.61 0.85 0.79 0.54

2 68 0.77 0.93 0.91 0.80

3 193 0.91 0.98 0.90 0.64

4 66 0.90 0.94 0.90 0.80

5 87 0.81 0.92 0.84 0.84

Can we get insight? • The most contributing feature for Find Digit problem -

int findDigit(int N, int digit){

…

…

LOOP (N != <constant value>){

…

N = N / <constant value>

…

}

}

Features for FindDigit problem analyzed. Given a multi-digit number and a digit, one has to find the number of times the digit appears in the number

Yes, we can! • The most contributing feature for Find Digit problem -


…

LOOP (N != <constant value>){

…

N = N / <constant value>

…

}

…

}


...

while(N != 0){

d = N%10;

if(d == digit)

...

N = N / 10;

}

}

Evaluation Rubric

Score Interpretation5 Completely correct and efficient

An efficient implementation of the problem using right control structures and data-dependencies.

4 Correct with some silly errorsCorrect control structures and closely matching data-dependencies. Some silly mistakes fail the code to pass test-cases.

3 Inconsistent logical structuresRight control structures start exist with few correct data dependencies

2 Emerging basic structuresAppropriate keywords and tokens present, showing some understanding of the problem

1 Gibberish codeSeemingly unrelated to problem at hand.

Automata – Sample report

Candidate’s source codeFeedback on programming best practices

Asymptotic complexity of the candidate’s solution

Test case pass/fail information

Problem summary

Do our fancy features help?

Control and Data dependency features add around 0.15 correlation points above token information.

PROBLEM Type of feature # of features Cross-val correl Train correl Validation correl

1All, w/o test case 35 0.57 0.72 0.56

Basic 60 0.62 0.87 0.41

2All, w/o test case 80 0.81 0.99 0.80

Basic 26 0.59 0.72 0.67

3All, w/o test case 190 0.87 0.97 0.90

Basic 26 0.74 0.89 0.74

4All, w/o test case 134 0.85 0.91 0.82

Basic 35 0.83 0.88 0.69

5All, w/o test case 166 0.66 0.81 0.64

Basic 40 0.61 0.78 0.61

Conclusion

• We propose the first machine learning based approach to automatically grade programs

• An innovative feature grammar is proposed which matches human intuition of grading programs.

• Models built for sample problems show promising results.

• We propose and demonstrate machine learning techniques to lower the need of human-graded data to build models.

Technology

Aspiring Minds | Automata