How to Implement a Search Engine Part 2: Query Index

Overview
This is the second part of our implementing a search engine project. The first part was about creating the inverted index. Now, we will use the index to answer actual search queries.

Query Types
Let’s first remember the query types. Our search engine is going to answer 3 types of queries that we generally use while searching.
1) One Word Queries (OWQ): OWQ consist of only a single word. Such as computer, or university. The matching documents are the ones containing the single query term.
2) Free Text Queries (FTQ): FTQ contain sequence of words separated by space like an actual sentence. Such as computer science, or Brown University. The matching documents are the ones that contain any of the query terms.
3) Phrase Queries (PQ): PQ also contain sequence of words just like FTQ, but they are typed within double quotes. The meaning is, we want to see all query terms in the matching documents, and exactly in the order specified. Such as “Turing Award”, or “information retrieval and web search”.

Implementation
Create index program of the previous part creates the inverted index and saves it to disk. Our query index program will first read the index file from disk and construct the index back in memory, in the same format as in create index. As described in the previous post, each line in the index file corresponds to a term and its postings list. The format was: term|docID1:pos1,pos2;docID2:pos3,pos4,pos5;… We construct the index in memory by reading the index file line by line. After reading a line, we split it on character “|”. The first part is the term, and the second part is its postings list. We further split the postings list part as follows. First we split it on “;”. This gives us the document lists in which the term appears in. Then we split each of those first on ‘:’, and then on ‘,’ to get the document ID, and the list of positions that the term occurs in that document. We perform these operations on all lines of the index file, and construct the same inverted index in the create index program. Which is a dictionary where each term is a key and the value is its postings list.

After constructing the index from the file that create index outputted, we are ready to answer search queries. Each query type has its own implementation. OWQ, and FTQ are easier to implement, but PQ are a little more difficult. Let’s go through their implementations one by one. Here are the documents that we will use as examples:
Doc1: Brown University computer science department, computer department
Doc2: department of computer science Brown University – science department computer
Doc3: computer science at Brown & science computer

The inverted index of this collection of documents as follows (to obtain dictionary terms from words, stemming is omitted for easier demonstration; but the lowercasing the word, filtering out the stop words – the words “at” and “of” in this case – and eliminating non-alphanumeric characters are not omitted):
{‘brown’: [ [1, [0]], [2, [3]], [3, [2]] ], ‘university’: [ [1, [1]], [2, [4]] ], ‘computer’: [ [1, [2, 5]], [2, [1, 7], [3, [0, 4]] ], ‘science’: [ [1, [3]], [2, [2, 5]], [3, [1, 3]] ], ‘department’: [ [1, [4, 6], [2, [0, 6]] ] }

The transformations performed on words of the collection, such as stemming, lowercasing, removing stopwords, and eliminating non-alphanumeric characters will be performed on the query as well. So, querying for computer or Computer is basically the same.

One Word Queries
The input in OWQ is a single term, and the output is the list of documents containing that term. If the term doesn’t appear in the collection (hence there is no entry for that term in our index), then the result is an empty list. Let’s say we search for Brown. The output should be [1, 2, 3] because the term brown appears in documents 1, 2, and 3. If we query for university, then the result is [1, 2] because that term appears in documents 1 and 2. What we are doing is, we get the postings list of the query term and retrieve the document IDs, which is the first element in the document lists of the postings list. So, the python code for OWQ would be:

try:
    term=getQueryFromUser()
    docs=[posting[0] for posting in index[term]]
except:
    #term is not in index
    docs=[]

Free Text Queries
The input in FTQ is a sequence of words, and the output is the list of documents that contain any of the query terms. So, we will get the list of documents for each query term, and take the union of them. It’s like evaluating a OWQ for every query term, and taking the union of the results. So, for the query Brown University, the output would be [1, 2, 3]. Note that even though university doesn’t appear in the 3rd document, it still matches the query because the term brown appears in that document. The result for the query computer science department is [1, 2, 3] because each document contains at least one of the query terms. Again, a document need not contain all of the query terms to be a match for the query. Containing 1 or more of the query terms is sufficient. So, the code for FTQ would be:

#now the query is a list of terms
terms=getQueryFromUser()
docs=set()
for term in terms:
    try:
        termDocs=[posting[0] for posting in index[term]]
        docs|=set(termDocs)
    except:
        #term is not in index
        pass
docs=list(docs)

Phrase Queries
The input in PQ is again a sequence of words, and the matching documents are the ones that contain all query terms in the specified order. We will now use the positional information about the terms which we didn’t need to use in OWQ and FTQ. The implementation is as follows. First we need the documents that all query terms appear in. We will again get the list of documents for each query term as we did in FTQ, but now we will take the intersection of them instead of union. Because we want the documents that all query terms appear in, instead of the documents that any query term appears. Then, we should check whether they are in correct order or not. This is the tricky part. For each document that contains all query terms, we should do the following operations. Get positions of the query terms in the current document, and put each to a separate list. So, if there are n query terms, there will be n lists where each list contains the positions of the corresponding query term in the current document. Leave the position list of the 1st query term as it is, subtract 1 from each element of the 2nd position list, subtract 2 from each element of the 3rd position list, …, subtract n-1 from each element of nth position list. Then intersect all the position lists. If the result of the intersection is non-empty, then all query terms appear in the current document in correct order, meaning this is a matching document. Perform these operations on all documents that every query term appears in. The matching documents are the ones that have non-empty intersection. Why does this algorithm work? Because, for each query term, we check whether it appears right after the previous query terms in the document. And we do this in an elegant way. Additionally, there is an optimization while performing position list intersections. We can start intersecting from smaller lists, because the size of the final intersection is always less than or equal to the size of the smallest list. Therefore, we can short-circuit whenever the intermediate intersection result becomes an empty list, obtaining an efficient algorithm.

An example will make everything clear. Let’s say we search for “computer science department”. We first get the document list of all query terms, as we did in FTQ: computer: [1, 2, 3], science: [1, 2, 3], and department: [1, 2]. Then we intersect all these lists to get the documents that contain all query terms, which is [1, 2]. Next, we should check whether the order is correct or not. First, we get the postings list of the query terms for document 1. Which is computer: [1, [2, 5]], science: [1, [3]], and department: [1, [4, 6]. Then, we extract the positions of each query term, and put them in separate lists, resulting in [ [2, 5], [3], [4, 6] ]. Each list corresponds to the positional information of a query term. We don’t touch the first list, but subtract i-1 from the elements in the ith list, resulting in [ [2, 5], [2], [2, 4] ]. Finally, we take the intersection of the lists, which is [2]. Since the intersection is not empty, we conclude that document 1 is a matching document. Next, we check document 2. Get the positions of query terms and put them to separate lists as before: [ [1, 7], [2, 5], [0, 6] ]. Perform the subtractions: [ [1, 7], [1, 4], [-2, 4] ]. And take the intersection: []. The result of the intersection is empty, meaning the query terms don’t appear in the correct order, so this is not a matching document. There is no more document that contains all query terms. So, the result of the phrase query is document 1 only: [1]. Here is the high level overview of the implementation (you can find all the details in the source code):

#get query terms
terms=getQueryFromUser()
for term in terms:
    if term not in index:
        #if a query term is not in the index,
        #there can't be any document containing it
        #so there is no match for the query
        return []
 
postings=getPostings(terms)
docs=getDocsFromPostings(postings)
docs=intersectLists(docs)
if docs==[]:
    #no document contains all query terms
    return []
 
#get the posting list of terms in docs
postings=getPostingsOfDocs(docs)
 
result=[]
#check whether terms are in correct order
#perform the necessary subtractions
postings=performSubtractions(postings)
#intersect the locations
for posting in postings:
    if intersectPositions(posting)!=[]:
        #current document is a match
        result.append(currentDocument)
 
return result

Source Code
Here is the source code. Note that you first have to create index in order to query it. Or you can directly download and use the workspace (size: 10MB). You can write your queries on command prompt and the program will display the document IDs that match the query. The results are not ranked, we will discuss how to rank search results in the next post.

VN:F [1.9.22_1171]
Rating: 9.1/10 (58 votes cast)
How to Implement a Search Engine Part 2: Query Index, 9.1 out of 10 based on 58 ratings
This entry was posted in Information Retrieval, Search Engines, Web Search. Bookmark the permalink.
  • http://swellnews.smkpo.edu.my/ Canan

    Saved as a favorite, I really like your blog! :)

  • http://www.cellrants.com/forums/member/59541 agundez

    your weblog is so uncomplicated to study, i like this write-up, so maintain submitting much more :)

  • http://www.ctp-inc.com/member/44624/ Florentina Acedo

    Remarkable post, thank you, I’ll book mark you now!

  • hero

    Hi Arden, why don’t you compare the next blocks one by one instead of subtracting numbers from the list and then check whether the list contains a common number?
    I mean, for this list: [ [2, 5], [3], [4,6] ] you could have 2 in mind and check if the next block contains 2+1=3 and the next block contains 2+2=4.

    • Arden

      You can do like that as well, it’s essentially the same thing. However, it’ll be less efficient. Because now we have to check for all elements in first list, whether the corresponding numbers are present in all other lists, one by one. We can use binary search to perform that check, since the lists are already sorted, but then it becomes more complicated and the code is much longer. In the subtraction approach, we’re basically doing the same thing but in a more efficient way. Because we’re taking advantage of python’s built in set data structure to perform intersections (the details are in the source code, I didn’t include it in the above code snippet to keep it simple).

      Let’s say we have N lists and there are K elements per list on the average. The complexity of the subtraction approach is: O(NK)
      O(NK) for subtractions and also O(NK) for set intersections (they can be done in linear time using hashtables).
      The complexity of the alternative approach would be: O(KNlogK)
      For all elements in the first list (K), check whether the correct number appears in all other lists (NlogK. logK to binary search a list and we do this for all N lists). It may not seem as a big difference but note that generally K>>N. Because there will be just a couple of terms in the query (N), but those terms will appear many times in the documents (K).

      In my toy example the difference isn’t noticed, but imagine those 3 lists to be longer (let’s say they contain 50 elements). And suppose that the first match occurs at 40th element in the first list. Then, for 39 previous elements, we have to check whether +1 is present in the second list, and +2 is present in the 3rd list, repetitively. We can instead perform the subtractions just once, and check for a common element by simply intersecting the lists.

      Here is the intersectLists function for clarification:

      def intersectLists(lists):
              if len(lists)==0:
                  return []
              #start intersecting from the smaller list
              lists.sort(key=len)
              return list(reduce(lambda x,y: set(x) & set(y),lists))
  • olga

    Well,what if we had a “queries.txt” file given to us by default(the user does not input any queries)?Another thing, if we had the wikipedia corpus in a directory including the xml files, what would change to your former code about testCollection?How would we test the codes then?I’m a newbie in python and your posts are the most useful I have found about creating search machines.

    • Arden

      You can input queries from a file simply by using file redirection, i.e. python queryIndex.py < queries.txt. The source code gets the search query via readline in a while loop, so it'll execute queries provided by the user or from file. About your second question, changing how you represent the corpus will only require modifications in the createIndex program. You'll need to parse xml and get document content accordingly, and then create the inverted index. As long as you create the index in the same format, queryIndex stays the same. Sorry for the late response by the way..

  • http://www.aol.com/239ca4b4e641227e55a6fa85bc33cda71eaa0c72 here

    This blog site is extremely good! How can I make one like this ?

  • Haja

    Hi Arden, What about response time for the logic you suggested for Pharase Query. If the index size is million of words, will this logic is efficient to give the answer quickly to user for the phrase query? Is this same logic used by Bing and Google? Or is there is any other faster and efficient way to do it?

    What if we build index for the phrases itself? instead of each single word… will it help?

    • Atul

      It is not uncommon to have bigram and trigram index where a search can be done on the phrase directly. If the query phrase is longer than 2-3 terms, then the technique described in this post is followed in order to combine multiple phrases.

  • Cheng Pan

    Hi,

    In PQ, what if I want to search for queries that include “stop word” such as “department of computer science”?

  • tommy

    hello i tried running thw code in python by following the steps in te read file but i getting errors in the port stemmer…can u troubleshoot yhos w/ me ?/..thxz great series