34
Exploring Natural Language Processing in Ruby Kevin Dias Tokyo Rubyist Meetup - April 9th, 2015 Rubyで自然言語処理の世界を探求してみよう

Exploring Natural Language Processing in Ruby

Embed Size (px)

Citation preview

Exploring Natural Language Processing in Ruby

Kevin Dias!Tokyo Rubyist Meetup - April 9th, 2015

Rubyで自然言語処理の世界を探求してみよう

Developer at

Twitter: @diasks2!GitHub: diasks2

Pragmatic Segmenter

Chat Correct

Word Count Analyzer

? ? ?

Pragmatic Segmenter

A rule-based sentence boundary detection gem that works out-of-the-box

across many languages.

What is segmentation?Segmentation is the process of splitting a text into segments or sentences. In other words,

deciding where sentences begin and end.

Pragmatic Segmenter

text = ”Hello Tokyo Rubyists. Let’s try segmentation.”

segment #1: Hello Tokyo Rubyists. segment #2: Let’s try segmentation.

Why care about segmentation?

Pragmatic Segmenter

Sentence segmentation is the foundation of many common NLP tasks:!

• Translation!• Machine translation!• Bitext alignment!• Summarization!• Part-of-speech tagging!• Grammar parsing

Errors in segmentation compound into errors in these other NLP tasks

Why reinvent the wheel?

Pragmatic Segmenter

• Most segmentation libraries are built to support only English (or English plus a few other languages)!

• Current solutions do not handle ill-formatted content well!

• Some libraries perform really well when trained with a data in a specific language and a specific domain, but what happens when your data could come from any language and/or domain

Sentence segmentation methods

Pragmatic Segmenter

• Machine learning !• Rule-based!• Tokenize-first group-later (e.g. Stanford CoreNLP)

How can we achieve the following in Ruby1?

string = “Hello world. Let’s try segmentation.”

Desired output: [“Hello world.”, “Let’s try segmentation.”]

Pragmatic Segmenter1 Using the core or standard library (no gems)

Time to check your solutions

Pragmatic Segmenter

Some potential answers

• string.scan(/[^\.]+[\.]/).map(&:strip)!• string.scan(/(?<=\s|\A)[^\.]+[\.]/)!• string.split(/(?<=\.)\s*/)!• string.split(/(?<=\.)/).map(&:strip)!• string.split('.').map { |segment| segment.strip.insert(-1, '.') }!• … your answer

Pragmatic Segmenter

Let’s change the original string

string = “Hello from Mt. Fuji. Let’s try segmentation.”

Desired output: [“Hello from Mt. Fuji.”, “Let’s try segmentation.”]

Pragmatic Segmenter

Uh oh…

string = “Hello from Mt. Fuji. Let’s try segmentation.”

=> [“Hello from Mt.”, “Fuji.”, “Let’s try segmentation.”]

string.scan(/[^\.]+[\.]/).map(&:strip)

Pragmatic Segmenter

Let’s brainstorm other edge cases that will make our first solution fail

• abbreviations!• …!• …!• …!• …!• …

Pragmatic Segmenter

Golden Rules

Pragmatic Segmenter

Currently 52 English Golden Rules covering edge cases such as:!• abbreviations!• abbreviations at the end of a sentence!• numbers!• parentheticals!• email addresses!• web addresses!• quotations!• lists!• geo coordinates!• ellipses

Rubyists like to keep it DRY

Pragmatic Segmenter

Most researchers either use the WSJ corpus or Brown corpus from the Penn Treebank to test their segmentation algorithm!!There are limits to using these corpora:!

1. The corpora may be too expensive for some people ($1,700)!2. The majority of the sentences in the corpora are sentences that end

with a regular word followed by a period, thus testing the same thing over and over again

In the Brown Corpus 92% of potential sentence boundaries come after a regular word.

The WSJ Corpus is richer with abbreviations and only 83% of sentences end with a regular word followed by a period.!

!Andrei Mikheev - Periods, Capitalized Words, etc.

A comparison of segmentation libraries

Pragmatic Segmenter

Name Language License Golden Rule Score !(English)

Golden Rule Score (Other Languages)

Speed

Pragmatic Segmenter Ruby MIT 98.08% 100.00% 3.84 s

TactfulTokenizer Ruby GNU GPLv3 65.38% 48.57% 46.32 s

Open NLP Java APLv2 59.62% 45.71% 1.27 s

Stanford CoreNLP Java GNU GPLv3 59.62% 31.43% 0.92 s

Splitta Python APLv2 55.77% 37.14% N/A

Punkt Python APLv2 46.15% 48.57% 1.79 s

SRX English Ruby GNU GPLv3 30.77% 28.57% 6.19 s

Scapel Ruby GNU GPLv3 28.85% 20.00% 0.13 s

† The performance test takes the 50 English Golden Rules combined into one string and runs it 100 times through each library. The number is an average of 10 runs.

The Holy Grail

Pragmatic Segmenter

A.M. / P.M. as non sentence boundary and sentence boundary

At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.

Golden Rule #18

All tested segmentation libraries failed this spec

["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."]

Chat Correct

A Ruby gem that shows the errors and error types when a correct

English sentence is diffed with an incorrect English sentence.

The problem

Chat Correct

I was giving a weekly Skype English lesson and the student was focusing on writing practice for the TOEFL test

I would correct the student’s sentence, but it would often seem as if he was missing some of my corrections - even if I read it with a

LOT OF STRESS!!

The idea

Chat Correct

A color coded way to

a student’s mistake(s)

PoInT OuT

The solution

Chat Correct

Word Count Analyzer

Analyzes a string for potential areas of the text that might cause word

count discrepancies depending on the tool used.

The problem

Word Count Analyzer

• Translation is typically billed on a per word basis!

• Different tools often report different word counts

I wanted to understand what was causing these differences in word count

Word count gray areas

Word Count Analyzer

Common word count gray areas include:!• Ellipses!• Hyperlinks!• Contractions!• Hyphenated Words!• Dates!• Numbers!• Numbered Lists!• XML and HTML tags!• Forward slashes and backslashes!• Punctuation

Visualize the gray areas

Word Count Analyzer

? ? ?

A bitext alignment (aka parallel text alignment) tool with a focus on high

accuracy

What’s it used for?• Translation memory!• Machine translation

? ? ?

Bitext alignmentCurrent commercial state-of-the-art!• Gale-Church sentence-length information plus

dictionary if available (e.g. hunalign)!

? ? ?

Areas for improvement

? ? ?

•Early misalignment compounds into errors throughout!

•Accuracy may suffer for non-Roman languages unless the algorithm is properly tuned!

•Does not handle cross alignments nor uneven alignments

A method for higher accuracy• Machine translate A - B and B - A!• Relative sentence length!• Order or position in the document

? ? ?

0 1 2 3 4 501 X2 X34 X5 X

X

The trade-offsPros!• better accuracy!• can handle crossing alignments!• can handle uneven segments matches !

(1 to 2, 2 to 1, 1 to 3, 3 to 1, 2 to 3, and 3 to 2)

? ? ?

Cons!• slower!• potential data privacy issues !

(depending on method to obtain machine translation)

Small framework for thinking about new problems

Step 1!Use your ignorance as a weapon to think about a problem from first principles (you aren’t yet weighed down with any bias).

Step 3!Diff your conceptual framework and your research. Look at where it diverges and try to understand why.!!

Has tech changed/advanced? Were you missing something?

Step 2!Do your research.

Ruby NLP Resources

https://github.com/diasks2/ruby-nlp