Duplicated Questions Detection

29 Sep 2017

Reading time ~3 minutes

Background

In Quora, there are too many questions meaning the same. For example, the two sentence “What is the most populous state in the USA?” and “Which state in the United States has the most people?” are asking the same thing. If duplicated questions are detected, it would be more efficient for user searching answers.

Traditional NLP method can be used for solving this problem, such as cosine similarity and Jaccard similarity [1]. In this post a recurrent neural network is described for detecting duplicated questions.

The Task

Quora issues the dataset [2] for duplicated questions detection. Part of the dataset is:

We need a model that can tell whether two questions are duplicated.

The Model

The architecture of the model is:

The remaining part of this section will describle the model layer by layer.

Embedding Layer

Giving a sentence \( S=(w_1, w_2,…, w_t) \), each word will represented as a row vector \( x_i \).

Encoding Layer

The bi-LSTM takes the sentence in forward and backward direction, and output the features of each time step. This layer encodes the sentence to a vector \( r \).

\[ r=[F_{h_1}, B_{h_1}, F_{h_2}, B_{h_2},…, F_{h_t}, B_{h_t}] \]

The following picture shows the LSTM cell.

The output of each cell \( h_i \) are computed by the following equation,

\[ f_i=\sigma([h_{i-1},x_i]W_f+b_f) \] \[ i_i=\sigma([h_{i-1},x_i]W_i+b_i) \] \[ g_i={\textrm{tanh}}([h_{i-1},x_i]W_g+b_g) \] \[ c_i=c_{i-1}\otimes{f_i}+i_i\otimes{g_i} \] \[ o_i=\sigma([h_{i-1},x_i]W_o+b_o)\] \[ h_i={\textrm{tanh}}(c_i)\otimes{o_i} \]

Multiple Layer Perceptron

The input of MLP is

\[ p=[r_1,r_2,r_1-r_2,r_1\cdot{r_2}] \]

and MLP classifies feature vector \( p \) as a class that indicate whether the two questions are the same. Class 0 means the two questions are not duplicate, while class 1 means they are duplicate. Let \( y \) denotes the class. MLP and \( y \) are calculated as follows,

\[ H_3={\textrm{relu}}(pW_3+b_3) \] \[ H_4={\textrm{relu}}(H_3W_4+b_4) \] \[ s={\textrm{softmax}}(H_4W_5+b_5) \]

\[ y={\textrm{argmax}}_i{s_i} \]

Training the Model

For the given two sentence, the loss function is:

\[ l_i=-\sum_{c\in C}{y_{ic}{\textrm{ln}}(s_{ic})} \] \[ l=\frac{1}{N}\sum_{i=1}^N{l_i}=-\frac{1}{N}\sum_{i=1}^N{\sum_{c\in C}{y_{ic}{\textrm{ln}}(s_{ic})}} \]

\( y_{ic} \) means that whether instance \( i \) belongs in class \( c \). The parameters of this model are as follows.

\[ \theta={W_f,b_f,W_i,b_i,W_g,b_g,W_o,b_o,W_3,b_3,W_4,b_4,W_5,b_5} \]

Use optimization method “Adam”[3] to optimaze the loss function. After training, the model can tell whether two given sentences are asking the same thing.

Experiment

Dataset is split to training and testing set during preprocessing.

Next step train this model, please check the hyperparameters.

Then evaluate this model after training.

I modify the model to CNN version and evaluate it again. I also use Jaccard similarity as the baseline, the evaluation are shown in the following table.

Model	Accuracy	Precision	Recall	F1
baseline	0.6657	0.5151	0.7297	0.6039
CNN based model	0.7911	0.6894	0.7316	0.7098
LSTM based model	0.8126	0.7085	0.7874	0.7459

Conclusions

LSTM based model has a better performance in this task.

References

[1] Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la SociétéVaudoise des Sciences Naturelles 37, 547-579.

[2] https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs, Quora Duplicate Questions Dataset.

[3] Kingma D, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014.