C / C++

October 21, 2010

Compressing URLs in your Webapp, for size and speed

Last year I had a chance to talk about the internals of our service: Pathtraq at Percona Performance Conference (slides), in which I described the methods we use to compress the URLs in our database to below 40% of the original size, however had not released the source code since then.  I am sorry for the delay, but have finally uploaded the code to github.com/kazuho/url_compress.

It is generally considered difficult to achieve high ratio for compressing short texts.  This is due to the fact that most compression algorithms are adaptive, i.e., short texts reach their end before the compressors learn how to encode them efficiently.

Our approach uses an algorithm known as "static PPM (prediction by partial matching)", that uses a pre-built table for compressing tables.

By using static PPM it is possible to achieve high compression ratio against hostnames or domains (e.g. "www" or "com") or English words that occasionally appear in paths or query parameters (e.g. "search").  And even in case of compressing unknown words in URLs the method works fairly well by utilizing the lower-order probability prediction tables for syllable-level compression.

The repository includes the source code of the compressor / decompressor (the core of the compressor; an optimized range coder using SSE, based on the work by Daisuke Okanohara, is actually git submoduleed from github.com/kazuho/rangecoder), Perl scripts for building compression tables, a small URL corpus for building and evaluating compression tables, and source code of a MySQL plugin that can be used to compress / decompress URLs using SQL.

By following the instructions in the README you will be able to build the compressor that compresses the URLs in the corpus to 30.5% in average, and use it in your MySQL database to store URLs more efficiently.  And one more thing, the benefit is not only for space efficiency.  Since the compressor is designed so that prefix search can be performed without actually decompressing them, in some cases your query becomes faster if you store the URLs in compressed form :-)

July 02, 2010

Writing a Fast and Secure HTTP Parser (in the case of HTTP::Parser::XS)

In May, I had a chance to give a talk at Tsukuba.xs Beer Talks #1.  I forgot to upload the slide since then, but finally, here it comes :-p

The slide titled "HTTP::Parser::XS - writing a fast & secure XS module" covers the design and implementation of HTTP::Parser::XS: an XS HTTP Parser that is used by a number of Plack-compatible web servers, and picohttpparser: its underlying HTTP parser written in C.

My intention was to introduce to programmers who mostly use Perl the fun of writing tight (fast and beautiful) code using the C programming language.  It would be grateful if you could feel some of my addiction from the slides.  Enjoy!

PS. And if you find any security holes or bugs, please let me know. Thank you in advance.

October 08, 2009

Cppref: reading cppreference.com docs offline, like man or info or perldoc

Today I created a tiny script called cppref, a wrapper for documents on cppreference.com.  Blurbs are:

  • docs are bundled with the interface, no network access required
  • works like man(1) or info(2) or perldoc(2)

It looks like follows.

$ cppref
You are here: C++ Reference

C++ Reference

                                  C++

General Topics                      * C++ Strings
                                    * C++ I/O
  * FAQ                                 + C++ String Streams
  * Pre-processor commands          * C++ Exceptions
  * Operator Precedence
  * Escape Sequences              C++ Standard Template Library (STL)
  * ASCII Chart
  * Data Types                      * Overview
  * Keywords                        * Iterators
                                    * C++ Algorithms
Standard C Library                  * C++ Vectors
                                    * C++ Double-Ended Queues
  * Overview                        * C++ Lists
  * Standard C I/O                  * C++ Priority Queues
  * Standard C String & Character   * C++ Queues
  * Standard C Math                 * C++ Stacks
  * Standard C Time & Date          * C++ Sets
  * Standard C Memory               * C++ Multisets
  * Other standard C functions      * C++ Maps
                                    * C++ Multimaps
                                    * C++ Bitsets

$ cppref vector
You are here: C++ Reference >> C++ Standard Template Library >> C++ Vectors

C++ Vectors

Vectors contain contiguous elements stored as an array.

Accessing members of a vector can be done in constant time, appending elements
to a vector can be done in amortized constant time, whereas locating a specific
value or inserting elements into the vector takes linear time.

Constructors create vectors and initialize them with some data
Operators    compare, assign, and access elements of a vector
assign       assign elements to a vector
at           returns an element at a specific location
back         returns a reference to last element of a vector
(snip)

Or if the specified term maps to multiple files,

$ cppref push_back        
multiple choices:
  stl::deque::push_back
  stl::list::push_back
  stl::vector::push_back
  string::push_back

By default, cppref uses w3m as its viewer, so you can follow the links to read the documents.

Cppref is available from search.cpan.org/dist/cppref or github.com/kazuho/cppref.  Hove fun!

August 26, 2009

Picoev: a tiny event loop for network applications, faster than libevent or libev

I am sure many programmers writing network applications have their own abstracting layers hiding the differences between various I/O multiplex APIs, like select(2), poll(2), epoll(2), ... And of course, I am one among them.  While writing mycached (see Mycached: memcached protocol support for MySQL for more information), I was at first considering of using libev for multiplexing socket I/Os.  Libevent was not an option since it does not (yet) provide multithreading support.

But it was a great pain for me to learn how to use libev.  I do not mean that its is an ugly product.  In fact, I think that it is a very well written, excellent library.  However, for me it was too much a boring task to learn how the things are abstracted, already being familiar with the problems it tries to hide.

So instead I thought it might be a good occasion to write my own library that could be used in any programs I may write in the future.  The result is picoev, and it is faster than libevent or libev!  The benchmark used is a re-modified version taken from libev.schmorp.de/bench.html and can be found here.

Why is it faster?  It is because it uses an array and a ring buffer of bit vectors as its internal structure.  Libevent and libev seem to use some kind of sorted tree to represent file descriptors.  However, if we concentrate on Un*x systems, there is a guarantee that the descriptors will a small positive integer.  Picoev utilizes the fact and stores information related to file descriptors (such as pointers to callback functions or callback arguments) in an array, resulting in a faster manipulation of socket states.

Another optimization technique used by picoev is not to use an ordered tree for keeping timeout information.  Generally speaking, most network applications do not require accurate timeouts.  Thus it is possible to use a ring buffer (a sliding array) of bit vectors for the purpose.  Each bit vector represents a set of file descriptors that time-outs at a given time.  Picoev uses 128 of bit vectors to represent timeouts, for example, the first bit vector represents the sockets that timeout a second after, the second bit vector representing them of two seconds after..., and the bit vectors slide every second.  If the maximum timeout required by the web application is greater than 128, the minimum granurality of timeout becomes two seconds.

I would like to reiterate that both libevent and libev are great libraries.  Picoev is not at comparable to them especially in maturity and the number of features.  It only supports select(2), epoll(2), and kqueue(2) for the time being.  However the design is simple than the other two, and I think it will be a good starting point to write network applications, or to use as a basis for writing one's own network libraries.

If you are interested in using picoev, the source code can be found at coderepos.org/share/browser/lang/c/picoev/.

July 02, 2009

今更 C++ で JSON パーサ「picojson」を書いたわけ

 既に mattn さんが、「Big Sky :: ヘッダファイルだけでC++から使えるJSONパーサ「picojson」が凄い!」で紹介してくださっています (mattn さん、アドバイス&バグ情報ありがとうございます!) が、いまさら C++ で JSON パーサを作りました。それは、以下の3点を満たすものがなかったから。

  • ヘッダファイル only
  • boost 等、他の重たいライブラリに依存しない
  • array や object が STL にマッピングされる

 コードは、coderepos に置いてありますので、よろしければお使いください (picojson.h)。

 なお、現時点での制限事項として、

  • \n や \r, \uXXXX といったエスケープの処理が未実装rev. 34232 で対応しました (含サロゲートペア)
  • 空白文字の判断基準が RFC と異なるrev. 34277 で空白と文字列定数内で許容する文字を RFC4627 に一致させました

といった点がありますので、解釈のゆれによってセキュリティ上問題が発生しうるケースでのご使用は注意していただければと思います。

 あとは、operator<< と >> を定義して、iostream から読み書きできるようにするくらいかなrev. 32477 の時点で operator<< と >> に対応済です。それでは have fun!

June 23, 2009

スレッド間で共有する変数のアクセス権制御を C++ コンパイラで強制する方法

 マルチスレッドなプログラムを書いていると、スレッド間で共有する変数へのアクセスを正しく直列化できているか、という点が常に問題になります。どうせなら、正しく書けているかコンパイル時に確認したいよね、ということで、以下のような C++ テンプレートを書いてみました。

template <typename T> class cac_mutex_t {
public:
 
  class lockref {
  protected:
    cac_mutex_t<T>* m_;
  public:
    lockref(cac_mutex_t<T>& m) : m_(&m) {
      pthread_mutex_lock(&m_->mutex_);
    }
    ~lockref() {
      pthread_mutex_unlock(&m_->mutex_);
    }
    T& operator*() { return m_->t_; }
    T* operator->() { return &operator*(); }
  private:
    lockref(const lockref&);
    lockref& operator=(const lockref&);
  };

protected:
  friend class cac_mutex_t<T>::lockref;
  T t_;
  pthread_mutex_t mutex_;
public:
  cac_mutex_t(pthread_mutexattr_t* attr) : t_() {
    pthread_mutex_init(&mutex_, attr);
  }
  ~cac_mutex_t() {
    pthread_mutex_destroy(&mutex_);
  }
  const T* unsafe_ref() const { return &t_; }
  T* unsafe_ref() { return &t_; }
private:
  cac_mutex_t(const cac_mutex_t&);
  cac_mutex_t& operator=(const cac_mutex_t&);
};

 使用例は以下のとおり。

// mutex lock が必須な Foo 型の変数を宣言
cac_mutex_t<Foo> foo_serialized(NULL);

{ // RAII で mutex lock してアクセス
  cac_mutex_t<Foo>::lockref foo(foo_serialized);
  foo->x = ...
  ...
}

 要は、RAII かつスマートポインタなクラスを使って、ロックをコントロールしつつ、アクセスの直列化を保証できるよ、ということです。

 テクニックとしては既存だと思いますが、ちょっとググった範囲では見つからなかったので、紹介させていただきました。

6月24日追記: 設計を整理、rwlock を追加して Q4M の svn に追加しました。(link)