« My Slides at YAPC::Asia 2010 (Writing Prefork Servers / Workers, Unix Programming with Perl) | Main | XSSに強いウェブサイトを作る – テンプレートエンジンの選定基準とスニペットの生成手法 (第1回神泉セキュリティ勉強会発表資料) »

October 21, 2010

Compressing URLs in your Webapp, for size and speed

Last year I had a chance to talk about the internals of our service: Pathtraq at Percona Performance Conference (slides), in which I described the methods we use to compress the URLs in our database to below 40% of the original size, however had not released the source code since then.  I am sorry for the delay, but have finally uploaded the code to github.com/kazuho/url_compress.

It is generally considered difficult to achieve high ratio for compressing short texts.  This is due to the fact that most compression algorithms are adaptive, i.e., short texts reach their end before the compressors learn how to encode them efficiently.

Our approach uses an algorithm known as "static PPM (prediction by partial matching)", that uses a pre-built table for compressing tables.

By using static PPM it is possible to achieve high compression ratio against hostnames or domains (e.g. "www" or "com") or English words that occasionally appear in paths or query parameters (e.g. "search").  And even in case of compressing unknown words in URLs the method works fairly well by utilizing the lower-order probability prediction tables for syllable-level compression.

The repository includes the source code of the compressor / decompressor (the core of the compressor; an optimized range coder using SSE, based on the work by Daisuke Okanohara, is actually git submoduleed from github.com/kazuho/rangecoder), Perl scripts for building compression tables, a small URL corpus for building and evaluating compression tables, and source code of a MySQL plugin that can be used to compress / decompress URLs using SQL.

By following the instructions in the README you will be able to build the compressor that compresses the URLs in the corpus to 30.5% in average, and use it in your MySQL database to store URLs more efficiently.  And one more thing, the benefit is not only for space efficiency.  Since the compressor is designed so that prefix search can be performed without actually decompressing them, in some cases your query becomes faster if you store the URLs in compressed form :-)


TrackBack URL for this entry:

Listed below are links to weblogs that reference Compressing URLs in your Webapp, for size and speed:


Post a comment