Pumping Up Your Applications with Xapian Full-Text Search

Needle SmallWhat good is an application—not matter how much information it contains—if the inability to easily search it renders it useless?

Xapian to the Rescue

Xapian is an excellent open source (GPL) search engine library. It is written in C++ and comes with bindings for Python as well as many other languages, and it supports everything you’d expect from a modern search engine:

  • Ranked probabilistic search – The results that are returned are ranked according to their relevancy, with the most relevant occurring first.
  • Boolean search operators – You can use AND, OR, NOT, XOR in your searches.
  • Phrase and proximity searching – For example, “used books” will look for occurrences of these words as an exact phrase, but you can also search for “used NEAR books” to find occurrences of the words “used” and “books” that are within 10 words of each other. You can even write “used NEAR/3 books” to change the proximity threshold to three
  • Stemming of search terms – If, for example, you search for “programmer,” you can find articles that mention “programmers” or “programming.” Xapian currently supports stemming in Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish
  • Stopwords – Xapian already knows which common words to ignore (like ‘are,’ ‘is,’ and ‘being’)
  • Simultaneous update and searching – Xapian allows to index new articles while the database is being searched. New articles can be searched right away.
  • Relevant Suggestions – Xapian can automatically suggest documents that are relevant to a given document. As such, you can add a list of “similar items” to each page.
  • Value-associated results – You can associate values like word-count, date, page views, diggs, and so on with each document. Xapian can return results that are sorted by any of these criteria
  • Document metadata – You can add metadata tags to each document (Xapian calls these terms). These tags can be anything you desire, like author, title, date, tags, and so on. Users can then search within the metadata by typing “author:john”.

Xapian diagram medium

The above diagram shows the main participants in a typical search-enabled use case. We assume that the data to be searched is stored in a relational database (the blue SQL server jar), but it doesn’t really matter where the data comes from. The indexer is a Python program that is executed periodically (as a cron job). Its function is to retrieve new or changed documents from the database and to index them. The Xapian library handles the actual read/write operations on the Xapian database (in the purple jar).

Since the Xapian library is not thread-safe and because Web applications are usually multithreaded, you need to implement a locking mechanism if you want access to a Xapian database to be safe. My preferred way for accomplishing this aim is to use a separate process (the orange Search Server box). This process will be a single-threaded RPC server that will handle all searches. The benefit of this strategy is that you can move the search server process (together with the Xapian database) to a different machine. In so doing, you can free up a lot of the resources on that server that runs your application. That makes the system very scalable. In general, you can expect any bottlenecks to be more IO and memory related than CPU related.

Alternatively, your search operations can be directly initiated from your application process. This alternative will work as long as you use a mutex to govern access to your database. However, I wouldn’t recommend doing this in a production environment. Why? Because you’ll never be sure what’s consuming so much memory—the Xapian library or your application.

The application (red box) gets a very clean search API. It simply connects to the XML RPC server (one line of code) and obtains access to a search() method which gets the search query and how many results are needed as arguments. It then returns a dictionary with the total number of available results and the results themselves.

In this tutorial, we’ll create a searchable article database. We assume that you already have Xapian and the Xapian bindings installed. We’ll start with the indexer.

The Indexer: The Golden Retriever

Golden Document RetrieverFollowing is the indexer code. It is tailored to TurboGears and SQLAlchemy, but it can be easily adapted to suit any ORM. It accepts three command line arguments: the configuration file, which helps it find the database (either dev.cfg or prod.cfg); the Xapian database location, which is simply a directory name; and a number of hours (such that all documents that were created or modified within this number of hours will be indexed). If you run the indexer every hour, you can safely give 2 as the third argument. If you’d like to re-index all articles, pass in 0 as the third argument.

#!/usr/bin/env python
from datetime import *
import xapian

WORD_RE = re.compile(r"\\w{1,32}", re.U)
ARTICLE_ID = 0

stemmer = xapian.Stem("en") # english stemmer

def create_document(article):
    """Gets an article object and returns
    a Xapian document representing it and
    a unique article identifier."
""

    doc = xapian.Document()
    text = article.text.encode("utf8")

    # go word by word, stem it and add to the
    # document.
    for index, term in enumerate(WORD_RE.finditer(text)):
        doc.add_posting(
          stemmer.stem_word(term.group()),
          index)
    doc.add_term("A"+article.submitted_by.user_name)
    doc.add_term("S"+article.subject.encode("utf8"))
    article_id_term = ‘I’+str(article.article_id)
    doc.add_term(article_id_term)
    doc.add_value(ARTICLE_ID, str(article.article_id))
    return doc, article_id_term

def index(db, period):
    """Index all articles that were modified
    in the last <period> hours into the given
    Xapian database"
""

    if period:
        start = datetime.now()-timedelta(hours=period)
        query = (Article.q.last_modified>start)
    else:
        query = None
    articles = session.query(Article).select(
       query)

    for article in articles:
        doc, id_term = create_document(article)
       db.replace_document(id_term, doc)

if __name__=="__main__":
    configfile, dbpath, period = sys.argv[1:]
     turbogears.update_config(configfile=configfile,
        modulename="myproject.config")
    from myproject.model import *
    turbogears.database.bind_meta_data()
    db = xapian.WritableDatabase(dbpath,
        xapian.DB_CREATE_OR_OPEN)
        index(db, int(period))

All strings that are passed to Xapian functions must be encoded in UTF-8. The create_document() function splits the article’s text into words, stems them, and then adds them one by one to the Xapian document. Next, a term with the username of the article’s author, prefixed by the letter ‘A,’ and the article’s subject, prefixed by the letter ‘S’ is added. Xapian gives special treatment to the first character of each term (i.e., it gives the terms its meaning). You’ll see how we use these terms when we code the search server.

Another term, this one prefixed by the letter ‘I’,’ is now added to render a unique article ID. The article ID is also assigned to the document as a value. This number relates a Xapian document to its authentic source in the SQL server.

The index() method function simply selects the relevant articles and builds a Xapian document object for them. Instead of using add_document(), which can cause an article to be indexed multiple times in the database, we use replace_document(), which is given a unique term. If a document is already indexed by the given term, it will be replaced with the given document; otherwise, a new document will be added to the database.

After the data is indexed, it is time to make it searchable.

The Search Server: Seeing Results

The role of the search server is to obtain queries from the application and to then return results. As we strive for a single-threaded implementation, the Twisted framework makes it extremely easy to write the code for this server. If you are not familiar with Twisted or XML RPC, don’t worry; just imagine we’re writing a controller with only one method exposed: xmlrpc_search().

import xapian

from twisted.web import xmlrpc, server
from twisted.internet import reactor, task
from time import time
from indexer import ARTICLE_ID

DEFAULT_SEARCH_FLAGS = (
        xapian.QueryParser.FLAG_BOOLEAN |
        xapian.QueryParser.FLAG_PHRASE |
        xapian.QueryParser.FLAG_LOVEHATE |
        xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE
        )

class SearchServer(xmlrpc.XMLRPC):
    def __init__(self, db):
        xmlrpc.XMLRPC.__init__(self)
        self.db = xapian.Database(db)
        self.parser = xapian.QueryParser()
        self.parser.add_prefix("author", "A")
        self.parser.add_prefix("subject", "S")
        self.parser.set_stemmer(xapian.Stem("en"))
        self.parser.set_stemming_strategy(xapian.QueryParser.STEM_SOME)

        # make sure database is reloaded every 10 minutes
        lc = task.LoopingCall(self.reopen_db)
        lc.start(600)

    def reopen_db(self):
        try:
            self.db.reopen()
        except IOError:
            print "Unable to open database"

    def xmlrpc_search(self, query, offset, count):
        """Search the database for <query>, return
        results[offest:offset+count], sorted by relevancy"
""

        query = self.parser.parse_query(query.encode("utf8"),
                DEFAULT_SEARCH_FLAGS)
        enquire = xapian.Enquire(self.db)
        enquire.set_query(query)
        try:
            mset = enquire.get_mset(offset, count)
        except IOError, e:
            if "DatabaseModifiedError" in str(e):
                self.reopen_db()
            raise

        results = []
        for m in mset:
            results.append(
                {"percent": m[xapian.MSET_PERCENT],
                "article_id": m[xapian.MSET_DOCUMENT].get_value(ARTICLE_ID)})
        return {"count": mset.get_matches_upper_bound(), "results": results}
import sys

if len(sys.argv)!=3:
    print "Usage: search.py <port> <db>"
    sys.exit(1)

reactor.listenTCP(int(sys.argv[1]), server.Site(SearchServer(sys.argv[2])))
reactor.run()

The search server constructor opens a database and initializes a query parser. We tell the query parser that the keyword ‘author’ refers to terms that are prefixed with the letter ‘A’ and that the keyword ‘subject’ refers to terms that are prefixed with the letter ‘S.’ This specification makes it possible to search for “author:john” or “subject:xapian.”

We then instruct Twisted to call reopen_db() every ten minutes. This reopening renders the latest changes in the database available to the search server. Each time Xapian’s library opens the database, it works against a fixed revision of it.

The xmlrpc_search() is the only method that is exposed (since its name is prefixed by xmlrpc_). The offset (zero-based) and limit arguments allow for an efficient way to split the search results into several pages. The call returns a dictionary with the total number of available results and a list with the selected subset of results. Each item in the list contains a unique article_id and a percent, which indicates each document’s relevancy score.

The program receives the port number to listen on and the Xapian database path. Unless you’d like to expose your search functionality to the world, it is suggested that you block outside access to this port.

By now, you’re probably eager to try searching your own database. Here’s a quick way to do so. First, start the search server:

$ python search.py 3000 ./my_database

From another terminal, start a Python shell:

>>> import xmlrpclib
>>> s = xmlrpclib.Server(‘http://localhost:3000′)
>>> s.search(‘python -snake’, 0, 10)
{‘count’: 2, ‘results’: [{‘percent’: 94, ‘article_id’: 15}, {‘percent’: 79, ‘article_id’: 6}]}

In the same manner, you can use the search server from your application.

Working with Smaller Databases

Search engines are optimized to return results that are sorted according to their relevancy. If you need your results sorted by another criterion, such as date or diggs, it might be useful to run the query over a smaller database. For example, you might try running it over a database that contains only articles from the previous month. This strategy can significantly increase your overall performance.

Alternatives to Xapian

While I haven’t tried working with search engines libraries other than Xapian, you can try the Java-based Lucene which can be accessed from Python using PyLucene. The TurboLucene library eases using PyLucene from TurboGears.

This entry was posted in howto, python, turbogears. Bookmark the permalink.

23 Responses to Pumping Up Your Applications with Xapian Full-Text Search

  1. Nice article! Just what I needed for next week’s project :-)

    BTW: How do you generate your colorized listings and the code divs? They look great!

    Chris

  2. thesamet says:

    Hi Chris,

    Thanks! Let me know how this worked for you in your project.

    The lists settings come from the Misty wordpress theme (which sets the appropriate CSS for UL and LI elements).

    To highlight source code I use Dean Lee’s wordpress plugin, with a little CSS tweaking.

  3. Pingback: krys.ca

  4. Steve says:

    I’m not sure if it’s WordPress or the source code highlighter but it’s making your single quotes non-ascii (‘I’ (‘/’) instead of ‘I’ (')), which will cause the code to fail if you copy and paste.

  5. Nadav Samet says:

    I think that WordPress is doing that. I’ll look into it.

  6. Zoom.Quiet says:

    sooo great idea!
    we are ChinaPythonUserGroup,just litter days ago ;
    know abt. Xapian
    http://wiki.woodpecker.org.cn/moin/Xapian

    now can plugin it into TurboGears!
    thnax for all!

  7. Rooju says:

    Thanks Nadav for this very informative writeup and the sample code. I’d started using Xapian, but I was indexing and searching in a very naive way. Your sample code was very helpful. I’m doing it in a much better way now! :)

  8. thesamet says:

    Rooju, I’m glad to hear it is working well for you!

  9. Jade says:

    I am starting a turbogears project and would like to try xapian as the search engine.

    How would you use this to index content served out of the turbogears ‘static’ directory (pdf, html, xml etc.)?

    Is there any additional documentation for the use of the python bindings to Xapian?

    Thanks for a great article,

    Jade

  10. Nadav Samet says:

    Hi Jade,

    You could use the same indexer outlined above to index the static content as well as the dynamic.

    The C API has one-to-one correspondence with the Python bindings, so the API documentation at xapian.org should be sufficient.

  11. Eduardo says:

    Hi thesamet,

    I´ve read your article because Im looking for information about how to use it in JAVA.

    I and my team-mates are working on web application for indexing files similar to youtube… more or less like that. We are using LUCENE at the moment, but we heard from our teacher that XAPIAN is much better (or only newer) than LUCENE.

    I´ve been googling for a JAVA compiled PACKAGE so we can include this framework using netbeans. I’d appreciate if you or any of the guys who read this know WHERE I can find such package.

    I know it can be compiled with the bindings but cant find information about it either.

    Thanks for your time reading this.

    Eduardo

  12. Chris Lu says:

    Please take a look at this

    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    You can create a full-text database search service, return results as HTML/XML/JSON. It uses the Lucene directly in java, but can be easily used with Ruby, PHP, or any existing database web applicatoins.

    You can easily index, re-index, incremental-index. It’s also highly scalable and easily customizable.

    The best thing is, it’s super easy. You can create a production-level search in 3 minutes.

  13. Peter says:

    When you say: “single-threaded implementation, the Twisted framework makes it extremely”, do you mean only one search query can be handle at any one time? So there is no concurrency, how does this scalable in internet world? Also, what plugin are you using for this blog, that allows “subscribe to comments via email”?

    Thank you

  14. thesamet says:

    Hi Peter,

    Great question. The short answer is: Yes, only one search query is handled in one time, but sometimes multiple threads don’t help too much. Here is the long answer.

    Let’s start with a word about Twisted servers. In general, single-threaded servers can handle many requests at the same time. The basic idea is that their execution flow is event-driven, instead of thread-driven (or system scheduler driven). All events are handled on the same thread, and the assumption is that handling an event is a short operation.

    As Xapian is not thread-safe by design, there is no way to interleave accesses to the database, and each query is handled from start to finish, regardless of the server’s implementation.

    So if needed, this solution can be scaled by starting several instances of the search server (possibly even on several machines) and load-balancing.

    As experience had taught me, you never become Google overnight. It’s best to stick with the simplest thing as long as it works.

    I use Subscribe to Comments plug-in.

    Nadav

  15. Mark says:

    Is there any reason why you didn’t implement the indexer on Twisted?
    The way you implemented it, there is no incremental update to the index. I am trying to index as objects are created/updated, thinking of using Twisted. Any thoughts on your part about how I may approach this.

    Thank you.

  16. Nadav Samet says:

    Hi Mark,

    Great question. The index() gets a period (say 1 hour) argument and then operates on all documents changed in that period. So it is meant to run as a cron job every hour.

    I believe that on a high load system, indexing documents in batches is faster than indexing on every change.

    If we would run the indexer in the same process of the search server, it will block searches while it indexes (and vice versa). But setting the indexer on its own process, can be an alternative to cron, but I’m not sure what the benefits are.

  17. sally says:

    Hello Nadav,

    Thanks for doing a great job of Xapian explanation.

    Can you please tell me what is the merit of using Xapian as against say lucene or some other full text search engine.

    I’m in the process of looking for a robust full text serach engine, that is very scalable and my product will be using the engine using apis.

    Do you have any opinions? I was leaning towards lucene, but may be I should look further (My concern for Lucene is that since it is Java based it may not be fast enought?)

    Thanks

    Sally

  18. Chris Lu says:

    I would argue Lucene is faster, because it is thread-safe. The concurrent searching threads Xapian can handle is limited by the number of process you can create before-hand, while Lucene was not.

    And the mis-conception that Java == slow is so last century…

    Chris Lu
    http://www.dbsight.net

  19. Richard Boulton says:

    Just to add to the comments about thread safety – while a Xapian database can indeed only safely be accessed from one thread at a time, we (the Xapian developers) have ensured that the cost of creating a Xapian database object is very low (it basically involves opening a few files, and no complex processing), and there are no global variables used or anything like that, so it is perfectly reasonable to open the database for each search request and handle each request in a separate thread. I’ve built many multi-threaded applications in Python which search Xapian databases.

    You are restricted to having only one instance of a database open for writing at a time, but as many instances of a database for reading as you like can be open (from a single process, or from many).

    We intentionally removed thread handling from the innards of Xapian because it was error-prone and imposed an overhead on all searches, whether single or multi threaded.

  20. stem_word is deprecated. Replace that line with stemmer(term.group()),.

  21. Pingback: david petar novakovic: attempted axiomatisation

  22. dperales says:

    hey, i got a problem, it seems that when i do “author:john wayne”, “john wayne” is not treated as a whole term,
    i tried ” author:’john wayne’ “, but nothing. what could be the possible problems? .

    i have to add that when i retrieve all terms, “john wayne” is showed as a whole,and “author:john” is not returning anything.

  23. Great article, and great code samples!

    I think the Xapian project needs a lot more documentation and examples just like in this post.

    Cristian

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>