Semantic search in Python

Shashank Shanu

9 months ago

Semantic Search In Python
Semantic Search In Python
Hello all,
Today, I am going to explain to you one of the interesting and most used topics of Natural Language processing (NLP) which is known as “Semantic search”. I will also try to show how we can implement it in python. Hope you got this article agenda. So, let’s start at…
To understand it better let’s try to take break it in two-part “Semantic” and “Search”.
First, let’s try to understand

What is Semantics?

“Semantics is a subfield in linguistics which study the meanings of words, their symbolic use and also their multiple meanings.”
Let’s take an example
Sentence 1: “Tom and Mary are married.”
Sentence 2: “Tom kissed his wife, and so did Jam”.
In sentence 1, it is not so clear that Tom and Mary are married to each other or they are married to some different person.
Similarly, in sentence 2, from the given sentence its not clear that did Jam kissed john’s wife or his wife.
So, it very important to understand the context and different meaning of a sentence.

Semantic search?

Semantic search becomes very important nowadays and most of the NLP based application goes beyond the ‘static’ dictionary meaning of a query to better understand the searcher’s intent within a specific context.
It tries to learn from past results and create links between previous entities, a search engine makes use of the contextual meaning of terms as they appear in the searchable database to generate more relevant results as per the user requirements.
Sematic search also provides an opportunity for the user to ask a question in natural language,
Let’s say we are searching ‘how do I start a career in data science?’ vs. ‘data science career steps and tips’. In the second query, we can see that there are no verbs or unnecessary words are present, just keywords are present that the user believes are relevant to any search engine to search.
Semantic the search uses the information from different sources to answer any query satisfactorily.
In the above search example, we can see the main result which google gives is a YouTube video about Jimmy Kimmel and Guillermo regarding Maddie Zeiger, the ‘star of Sia’s Chandelier music video’.

How it works actually?

  • Google understands that in the query ‘who is X’, ‘X’ must have a person’s name as a result.
  • Note that both ‘Maddie Ziegler’ and ‘Guillermo’ are highlighted, and this is an incorrect result from Google. On the other side, ‘Jimmy’ is not highlighted. Probably because Guillermo is closer to the verb ‘dance’ in the sentence than ‘Jimmy’ is. For more advanced readers, you might notice that the pronoun ‘he’ in the third line refers to Jimmy, and both men are in the same category and therefore equally close to the verb ‘dance’, but successfully linking ‘he’ to ‘Jimmy’ is another linguistic problem, called coreference resolution and is not resolved well in this example. (Wikipedia link,Stanford NLP Group’s implementation).
  • There is no literal match for ‘dancer in chandelier video’ with ‘star of chandelier music video … who is a phenomenal dancer’. The words don’t appear next to each other, yet the search engine makes the connection between ‘star in a music video’ and ‘dancer’.
Let say we are searching now “Where does the homer Simpson work”.
Where does the homer Simpson work
Where does the homer Simpson work
We can say that even if another search engines have implemented semantic search functionality in recent years, Google was the first to do so, with the Hummingbird update in 2013, and the most accurate one as to date.
As we understand, what is Semantic search? And also get an overview that how actually it works. Let me try to show you its python implementation.

Semantic Search example

Note: We should remember that there are two important points that are important in any kind of Search i.e.,
1.     Accuracy of the search result
2.     Speed of the search result
In order to implement Semantic Search, we should first need to understand how search is implemented.s
Any kind of search is generally done using a reverse index. Let me explain to you.
In the first step, we create a vocabulary in which let’s say we take all the unigrams, bigrams and trigrams in all the documents and make a unique list. Then we remove the stop words etc.
After applying the above steps let’s say we come with a 30K keywords vocabulary. Then we create a bag of words of each document where the words can be only of the vocabulary chosen.
Lets me give you an example so that you may get better understanding.
Example:
Sentence 1: My name is Sumit. I like InsideAIML discuss portal
Sentence 2: InsideAIML Discuss portal is great
all_keywords = (my, name, is, sumit, insideaiml, discuss, portal, great, i, like)
vocabulary = (sumit, insideaiml, discuss, portal)
Then the bag of words from the above sentences are
Sentence 1: bag of words [1,1,1,1]
Sentence 2: bag of words [0,1,1,1]
Now, while visualizing it each of the keywords in the vocabulary can be taken as a dimension and our documents now become vectors in this higher dimension space (4-dimension space in this case). This is exactly what bag of words mean. And this is our index for the given sentences.
As we got the index of the sentences. Now we create a reverse index, where we try to create a dictionary which maps each keyword to the list of documents. It is present in sorted according to matching score (Simplest being number of times the keyword occurs in the document)
Reverse Index
sumit-> [Sentence 1]
insideaiml -> [Sentence2, Sentence1]
discuss -> [Sentence2, Sentence1]
portal -> [Sentence1]
Now, if any query keyword comes, we just try to return the top matching documents.
Let’s say if someone queries “sumit” as the search term. It will return just sentence 1. And if someone queries “portal” only Sentence 2 is returned.

Now, in order to make a semantic search.
For example - Even if someone searches ML, we want to return results of ML as well as Machine Learning and also some other algorithms related to ML.
So, in that case, we need to club the vocabulary keywords together to a higher conceptual level. This is done by reducing the vocabulary dimensions using techniques such as LSI or Latent Semantic Indexing.
LSI is nothing but a combination of tfidf (increase relevant words and decrease the weightage of common English words) and dimensional reduction using SVD and taking the principal values. This is done to create a model which takes our original vector in vocabulary space to the lower dimensional conceptual space.
In, our final step we take both the keyword and all the documents in the lower dimensional conceptual space and try to find similarity of the keyword with each of them. So that the most similar results are returned while a user tries to search for something.
I hope after you enjoyed reading this article and finally, you came to know about Semantic search in Python.
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…

Submit Review