DSSM (Deep Semantic Similarity Model) - Building in TensorFlow

September 15, 2017

DSSM is a Deep Neural Network (DNN) used to model semantic similarity between a pair of strings. In simple terms semantic similarity of two sentences is the similarity based on their meaning (i.e. semantics), and DSSM helps us capture that.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between them is based on the likeness of their meaning or semantic content as opposed to similarity which can be estimated regarding their syntactical representation (e.g. their string format)

The model can be extended to any number of pairs of strings. Here, we will take two strings as input - a query and phrase.

The figure below depicts the architecutre of the model.

The Model

Well how do we build this model?

Build a bag-of-words (BOW) representation for each string (query/phrases). This is referred to as Term Vector in the figure.
Convert the BOW to bag-of-CharTriGrams, executed in Word Hashing layer in the figure.
Perform three non-linear transformation, shown as Multi-layer non-linear projection in figure.
The above steps are applied to each string (query/phrases) and cosine similarity is calculated for each query-phrase pair. This is the layer Relevance measured by cosine similarity in the figure.
Apply softmax to the outputs - Posterior probability captured by softmax in figure. The original paper uses softmax with a smoothing factor. We ignore the smoothing factor here.

A note on Word Hashing

Bag of Words representation is a poor representation of a sentence. Some of its problems are:

It does not capture context or semantic meaning of a sentence. For instance “will this work” and “this will work” have the same BOW representation though they represent different meaning.
It does not scale well. The size of a BOW vector is proportional to the size of vocabulary. As the vocabulary grows, the BOW becomes larger. For example, English has more than 170,000 words and new words are continuously added.

Word Hashing was introduced in the original work as a solution to the scaling problem faced by using BOW. We have more than 170,000 words in English vocabulary. Scaling becomes a problem as the current vocabulary size is large and addition of new words can make the problem worse. This can be solved if we break down each word to a bag of char-trigrams. As the number of char-trigrams are fixed and small, using a bag-of-CharTriGrams can be a good solution to this problem. To get the char-trigrams for a word, we first append ‘#’ to both ends of the word and then spilt it into tri-grams.

For the word fruit ( #fruit# ), the char-trigrams are [#fr, fru, rui, uit, it#]

Code

Now we will look into the code required for each of the steps.

Steps 1 and 2

The following functions convert sentences to bag-of-CharTriGrams.

def gen_trigrams():
  """
      Generates all trigrams for characters from `trigram_chars`
  """
  trigram_chars="0123456789abcdefghijklmnopqrstuvwxyz"
  t3=[''.join(x) for x in itertools.product(trigram_chars,repeat=3)] #len(words)>=3
  t2_start=['#'+''.join(x) for x in itertools.product(trigram_chars,repeat=2)] #len(words)==2
  t2_end=[''.join(x)+'#' for x in itertools.product(trigram_chars,repeat=2)] #len(words)==2
  t1=['#'+''.join(x)+'#' for x in itertools.product(trigram_chars)] #len(words)==1
  trigrams=t3+t2_start+t2_end+t1
  vocab_size=len(trigrams)
  trigram_map=dict(zip(trigrams,range(1,vocab_size+1))) # trigram to index mapping, indices starting from 1
  return trigram_map

def sentences_to_bag_of_trigrams(sentences):
  """
      Converts a sentence to bag-of-trigrams
      `sentences`: list of strings
      `trigram_BOW`: return value, (len(sentences),len(trigram_map)) size array
  """
  trigram_map=gen_trigrams()
  trigram_BOW=np.zeros((len(sentences),len(trigram_map))) # one row for each sentence
  filter_pat=r'[\!"#&\(\)\*\+,-\./:;<=>\?\[\\\]\^_`\{\|\}~\t\n]' # characters to filter out from the input
  for j,sent in enumerate(sentences):
      sent=re.sub(fiter_pat, '', sent).lower() # filter out special characters from input
      sent=re.sub(r"(\s)\s+", r"\1", sent) # reduce multiple whitespaces to single whitespace
      words=sent.split(' ')
      indices=collections.defaultdict(int)
      for word in words:
          word='#'+word+'#'
          #print(word)
          for k in range(len(word)-2): # generate all trigrams for word `word` and update `indices`
              trig=word[k:k+3]
              idx=trigram_map.get(trig, 0)
              #print(trig,idx)
              indices[idx]=indices[idx]+1     
      for key,val in indices.items(): #covert `indices` dict to np array
          trigram_BOW[j,key]=val
  return trigram_BOW

Step 3

We construct three Fully Connected (FC) layers of size 300,300 and 128 respectively

 def FC_layer(X,INP_NEURONS,OUT_NEURONS, X_is_sparse=False):
     """
         Create a Fully Connected layer
         `X`: input array/activations of previous layer
         `INP_NEURONS`: number of neurons in previous layer
         `OUT_NEURONS`: number of neurons in this layer
         `X_is_sparse`: bool value to indicate if input `X` is sparse or not.
                        Default value is False

     """
     limit=np.sqrt(6.0/(INP_NEURONS+OUT_NEURONS))
     W=tf.Variable(tf.random_uniform(
         (INP_NEURONS,OUT_NEURONS), -limit, limit), #weight init
         name="weight")
     b=tf.Variable(tf.random_uniform((OUT_NEURONS),-limit, limit), name="bias") # bias
     prod=tf.sparse_tensor_dense_matmul(X,W) if X_is_sparse else tf.matmul(X,W) #linear transformation
     return tf.nn.tanh(prod+b) # non-linear (tanh) transformation

Step 4

Calculate the cosine similarity of each pair.

 def cosine_similarity(self,A,B):
     """
         Function to calculate cosine similarity between two (batches of) vectors
         `A` and  `B`: Inputs (shape => [batch_size,vector_size])
         `sim`: Return value, cosine similarity of A and B
     """
     Anorm=tf.nn.l2_normalize(A, dim=1) # normalize A
     Bnorm=tf.nn.l2_normalize(B, dim=1) # normalize B
     sim=tf.reduce_sum(Anorm*Bnorm, axis=1) # dot product of normalized A and normalized B
     return sim

Step 5

Softmax and cross entropy function

 # softmax and cross-entropy
 smax=tf.nn.softmax_cross_entropy_with_logits(logits=batch_cosine_similarities, labels=Y)

Finally we train the model and test it on a held out dataset.