Is that a duplicate quora question

Is That ADuplicate Quora Question?

Abhishek Thakur@abhi1thakur

About Me➢ I’m a data scientist

➢ I like:○ scikit-learn○ keras○ xgboost○ python

➢ I don’t like:○ errrR○ excel

I like big data and

I cannot lie

The Problem➢ ~ 13 million questions (as of March, 2017)➢ Many duplicate questions➢ Cluster and join duplicates together➢ Remove clutter

➢ First public data release: 24th January, 2017

Duplicate Questions➢ How does Quora quickly mark questions as needing improvement?➢ Why does Quora mark my questions as needing improvement/clarification

before I have time to give it details? Literally within seconds…

➢ What practical applications might evolve from the discovery of the Higgs Boson?

➢ What are some practical benefits of discovery of the Higgs Boson?

➢ Why did Trump win the Presidency?➢ How did Donald Trump win the 2016 Presidential Election?

Non-Duplicate Questions➢ Who should I address my cover letter to if I'm applying for a big company like

Mozilla?➢ Which car is better from safety view?""swift or grand i10"".My first priority is

safety?

➢ Mr. Robot (TV series): Is Mr. Robot a good representation of real-life hacking and hacking culture? Is the depiction of hacker societies realistic?

➢ What mistakes are made when depicting hacking in ""Mr. Robot"" compared to real-life cybersecurity breaches or just a regular use of technologies?

➢ How can I start an online shopping (e-commerce) website?➢ Which web technology is best suitable for building a big E-Commerce

website?

The Data➢ 400,000+ pairs of questions➢ Initially data was very skewed➢ Negative samples from related questions➢ Not real distribution on Quora’s website➢ Noise exists (as usual)

https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

The Data➢ 255045 negative samples (non-duplicates) ➢ 149306 positive samples (duplicates)➢ 40% positive samples

The Data➢ Average number characters in question1: 59.57➢ Minimum number of characters in question1: 1➢ Maximum number of characters in question1: 623

➢ Average number characters in question2: 60.14➢ Minimum number of characters in question2: 1➢ Maximum number of characters in question2: 1169

Basic Feature Engineering➢ Length of question1➢ Length of question2➢ Difference in the two lengths➢ Character length of question1 without spaces➢ Character length of question2 without spaces➢ Number of words in question1➢ Number of words in question2➢ Number of common words in question1 and question2

Basic Feature Engineering

➢ Basic feature set: fs-1

data['len_q1'] = data.question1.apply(lambda x: len(str(x)))

data['len_q2'] = data.question2.apply(lambda x: len(str(x)))

data['diff_len'] = data.len_q1 - data.len_q2

data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))

data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))

data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split()))

data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))

data['common_words'] = data.apply(lambda x:

len(set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split()))), axis=1)

Fuzzy Features➢ Also known as approximate string matching➢ Number of “primitive” operations required to convert string to exact match➢ Primitive operations:

○ Insertion○ Deletion○ Substitution

➢ Typically used for:○ Spell checking○ Plagiarism detection○ DNA sequence matching○ Spam filtering

Fuzzy Features➢ pip install fuzzywuzzy

➢ Uses Levenshtein distance➢ QRatio➢ WRatio➢ Token set ratio➢ Token sort ratio➢ Partial token set ratio➢ Partial token sort ratio➢ etc. etc. etc.

https://github.com/seatgeek/fuzzywuzzy

Fuzzy Features

➢ Fuzzy feature set: fs-2

data['fuzz_qratio'] = data.apply(lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_WRatio'] = data.apply(lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_partial_ratio'] = data.apply(lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_partial_token_set_ratio'] = data.apply(lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])),

axis=1)

data['fuzz_partial_token_sort_ratio'] = data.apply(lambda x: fuzz.partial_token_sort_ratio(str(x['question1']),

str(x['question2'])), axis=1)

data['fuzz_token_set_ratio'] = data.apply(lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_token_sort_ratio'] = data.apply(lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)

TF-IDF➢ TF(t) = Number of times a term t appears in a document / Total number of

terms in the document➢ IDF(t) = log(Total number of documents / Number of documents with term t in

it)➢ TF-IDF(t) = TF(t) * IDF(t)

tfidf = TfidfVectorizer(min_df=3, max_features=None,

strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',

ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1,

stop_words = 'english')

SVD➢ Latent semantic analysis➢ scikit-learn version of SVD➢ 120 components

svd = decomposition.TruncatedSVD(n_components=120)

xtrain_svd = svd.fit_transform(xtrain)

xtest_svd = svd.transform(xtest)

A Combination of TF-IDF & SVD➢ TF-IDF features: fs3-1

A Combination of TF-IDF & SVD➢ TF-IDF features: fs3-2

A Combination of TF-IDF & SVD➢ TF-IDF + SVD features: fs3-3



Word2Vec Features➢ Multi-dimensional vector for all the words in any dictionary➢ Always great insights➢ Very popular in natural language processing tasks➢ Google news vectors 300d

Word2Vec Features

Word2Vec Features➢ Representing words➢ Representing sentences

def sent2vec(s):

words = str(s).lower().decode('utf-8')

words = word_tokenize(words)

words = [w for w in words if not w in stop_words]

words = [w for w in words if w.isalpha()]

M = []

for w in words:

M.append(model[w])

M = np.array(M)

v = M.sum(axis=0)

return v / np.sqrt((v ** 2).sum())

W2V Features: Cosine Distance

W2V Features: Manhattan Distance➢ Also known as cityblock distance

W2V Features: Canberra Distance

W2V Features: Minkowski Distance

W2V Features: Braycurtis Distance

W2V Features: WMD

Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.

W2V Features: Skew➢ Skew = 0 for normal distribution➢ Skew > 0: more weight in left tail

W2V Features: Kurtosis➢ 4th central moment over the square of variance➢ Types:

○ Pearson○ Fisher: subtract 3.0 from result such that result is 0 for normal distribution

W2V Features➢ Word2Vec feature set: fs-4

scipy.spatial.distance

scipy.stats

minkowski

jaccard

manhattanbraycurtis

euclidean

cosine

canberra

kurtosisskew

Raw Word2Vec Vectors

https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne

➢ Raw W2V feature set: fs-5

Features Snapshot

Feature Snapshot

Machine Learning Models

Machine Learning Models➢ Logistic regression➢ Xgboost➢ 5 fold cross-validation➢ Accuracy as a comparison metric (also, precision + recall)➢ Why accuracy?

Results

Deep Learning

LSTM➢ Long short term memory➢ A type of RNN➢ Learn long term dependencies➢ Used two LSTM layers

1D CNN➢ One dimensional convolutional layer➢ Temporal convolution➢ Simple to implement:

for i in range(sample_length):

y[i] = 0

for j in range(kernel_length):

y[i] += x[i-j] * h[j]

Embedding Layers➢ Simple layer➢ Converts indexes to vectors➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

Time Distributed Dense Layer➢ TimeDistributed wrapper around dense layer➢ TimeDistributed applies the layer to every temporal slice of input➢ Followed by Lambda layer➢ Implements “translation” layer used by Stephen Merity (keras snli model)

model1 = Sequential()

model1.add(Embedding(len(word_index) + 1,

300,

weights=[embedding_matrix],

input_length=40,

trainable=False))

model1.add(TimeDistributed(Dense(300, activation='relu')))

model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))

GloVe Embeddings➢ Count based model➢ Dimensionality reduction on co-occurrence counts matrix➢ word-context matrix -> word-feature matrix➢ Common Crawl

○ 840B tokens, 2.2M vocab, 300d vectors

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation

Basis of Deep Learning Model➢ Keras-snli model: https://github.com/Smerity/keras_snli

Before Training DeepNets➢ Tokenize data➢ Convert text data to sequences

tk = text.Tokenizer(nb_words=200000)

max_len = 40

tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))

x1 = tk.texts_to_sequences(data.question1.values)

x1 = sequence.pad_sequences(x1, maxlen=max_len)

x2 = tk.texts_to_sequences(data.question2.values.astype(str))

x2 = sequence.pad_sequences(x2, maxlen=max_len)

word_index = tk.word_index

Before Training DeepNets➢ Initialize GloVe embeddings

embeddings_index = {}

f = open('data/glove.840B.300d.txt')

for line in tqdm(f):

values = line.split()

word = values[0]

coefs = np.asarray(values[1:], dtype='float32')

embeddings_index[word] = coefs

f.close()

Before Training DeepNets➢ Create the embedding matrix

embedding_matrix = np.zeros((len(word_index) + 1, 300))

for word, i in tqdm(word_index.items()):

embedding_vector = embeddings_index.get(word)

if embedding_vector is not None:

embedding_matrix[i] = embedding_vector

Final Deep Learning Model


Model 1 and Model 2



300,


input_length=40,

trainable=False))


model1.add(Lambda(lambda x: K.sum(x, axis=1),

output_shape=(300,)))



300,


input_length=40,

trainable=False))


model2.add(Lambda(lambda x: K.sum(x, axis=1),

output_shape=(300,)))


Model 3 and Model 4

Model 3 and Model 4model3 = Sequential()


300,


input_length=40,

trainable=False))

model3.add(Convolution1D(nb_filter=nb_filter,

filter_length=filter_length,

border_mode='valid',

activation='relu',

subsample_length=1))

model3.add(Dropout(0.2))

.

.

.

model3.add(Dense(300))

model3.add(Dropout(0.2))

model3.add(BatchNormalization())


Model 5 and Model 6


model5.add(Embedding(len(word_index) + 1, 300, input_length=40,

dropout=0.2))

model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))


model6.add(Embedding(len(word_index) + 1, 300, input_length=40,

dropout=0.2))

model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))


Merged Model

Time to Train the DeepNet➢ Total params: 174,913,917➢ Trainable params: 60,172,917➢ Non-trainable params: 114,741,000

➢ NVIDIA Titan X

Combined Results

The deep network was trained on

an NVIDIA TitanX and took

approximately 300 seconds for

each epoch and took 10-15 hours

to train. This network achieved

an accuracy of 0.848 (~0.85).

Improving Further➢ Cleaning the text data, e.g correcting mis-spellings➢ POS tagging➢ Entity recognition➢ Combining deepnet with traditional ML models

Timeline

24 Jan, 2017

27 Feb, 2017

16 Mar, 2017

7 Jun, 2017

Quora Dataset Release

My Model + Writeup Release

Kaggle Competition On The Same Data Begins

Competition EndsFrustration Plot

Frus

tratio

n

Time in competition

Conclusion & References➢ The deepnet gives near state-of-the-art result➢ BiMPM model accuracy: 88%

Some reference:

➢ Zhiguo Wang, Wael Hamza and Radu Florian. "Bilateral Multi-Perspective Matching for Natural Language Sentences," (BiMPM)

➢ Matthew Honnibal. "Deep text-pair classification with Quora's 2017 question dataset," 13 February 2017. Retreived at https://explosion.ai/blog/quora-deep-text-pair-classification

➢ Bradley Pallen’s work: https://github.com/bradleypallen/keras-quora-question-pairs

https://explosion.ai/blog/quora-deep-text-pair-classification

https://explosion.ai/blog/quora-deep-text-pair-classification

Thank you!Questions / Comments?

Code: bit.ly/quoraduplicates

Get in touch:

➢ E-mail: [email protected]➢ LinkedIn: bit.ly/thakurabhishek➢ Kaggle: kaggle.com/abhishek➢ Twitter: @abhi1thakur

If everything fails, use Xgboost

Data & Analytics

Is that a duplicate quora question