Towards the New Czech Grammar-checker · Current best Czech GC is part of proprietary system Create...

Preview:

Citation preview

Towards the New Czech Grammar-checkerRASLAN 2018

Vojtěch Mrkývkamrkyvka@phil.muni.cz

December 7, 2018

Introduction Goal

Goal

New grammar-checker of CzechWeb-based applicationUsing new and existing tools developed at MU

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 2 / 13

Introduction Motivation

Motivation

There are tools existing / in development at MUCurrent best Czech GC is part of proprietary systemCreate an alternative to applications like Grammarly but forCzech

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 3 / 13

Current version Interface

The current interface

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 4 / 13

Current version Interface

The current interface

Based on on-line text processor tinyMCEMostly in JavaScript as tinyMCE modulesAsynchronous processCommunication with backend via AJAX

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 5 / 13

Current version Processing diagram

Processing diagram

tokenization correctiondisplaying

lemmatization& tagging

somemodule

somemodule

somemodule

somemodule

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 6 / 13

Current version Correction displaying

Correction displaying

The dog is runing .0 1 2 3 4 5 6 7

Tokens to display mistake at: 6Correction: 6/runing/running

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 7 / 13

Current version Implemented modules

Implemented modules

Correction TP FP TN FN pre recMisspellings (excl. proper nouns) 24 0 487 16 1,000 0,600Misspellings (incl. proper nouns) 7 17 497 6 0,292 0,538Vocalisation of prepositions 4 0 8 0 1,000 1,000Multiple whitespaces 4 0 515 0 1,000 1,000Whitespace in the interpunction proximity 7 0 119 0 1,000 1,000Conditionals 2 0 1 0 1,000 1,000Commas in a sentence 3 0 0 4 1,000 0,429

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 8 / 13

Proximate issues Genuine testing

A problem with testing

Testing data were too smallMistakes were artificial⇒ Need for API & collection of correctly annotated genuine texts

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 9 / 13

Proximate issues Implemented modules

Implemented modules

Correction TP FP TN FN pre recMisspellings (excl. proper nouns) 24 0 487 16 1,000 0,600Misspellings (incl. proper nouns) 7 17 497 6 0,292 0,538Vocalisation of prepositions 4 0 8 0 1,000 1,000Multiple whitespaces 4 0 515 0 1,000 1,000Whitespace in the interpunction proximity 7 0 119 0 1,000 1,000Conditionals 2 0 1 0 1,000 1,000Commas in a sentence 3 0 0 4 1,000 0,429

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 10 / 13

Proximate issues Spell-checking

A problem with spell-checking

Precision is lowNot often updated dictionary⇒ Method of adding new words, using different lexicon. . .

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 11 / 13

Proximate issues Error reporting

A problem with error reporting

Allow users to flag miscorrectionsHow to not display miscorrection afterwards?⇒ Probably module-depending

V.Mrkývka · Towards the New Czech Grammar-checker · December 7, 2018 12 / 13

Thank you for your attention!

This work was supported by the project of specific research Čeština v jednotě synchronie a diachronie (Czechlanguage in unity of synchrony and diachrony; project no. MUNI/A/0862/2017).