in English

January 22, 2010

Q4M 0.9.2 prerelease avaiable fixing data corruption on 32bit systems

Thanks to a user of Q4M, I have found a bug that would likely lead to data corruption on 32bit versions of Q4M.  64bit versions are unaffected.

Q4M by default uses mmap(2) to read from data files.  On 32bit systems, it tries to map max. 1GB per each table into memory using mmap.  When mmap fails to map memory due to low memory, Q4M falls back to file I/O to read the data.

However there was a bug in handling the response from mmap, that led to reading corrupt data from database files when mmap(2) failed after the size of the underlying file was grown / shrunk by Q4M.  And since Q4M writes back the corrupt data into the database file when rows are being consumed, the bug will likely destroy the database files.

I have fixed the bug and have uploaded Q4M 0.9.2, into the prerelease directory at q4m.31tools.com/dist/pre.  Source tarball and prebuilt binaries for MySQL 5.1.42 for 32bit linux are available.

If you are using 32bit versions of Q4M, I highly recommend either to update to 0.9.2 or switch to 64bit versions if possible.

BTW in 0.9.2 release, I also changed the maximum mmap size per table from 1GB to 256MB, to lower the possiblity of low memory.  However this countermeasure might not be sufficient in some cases, i.e. databases having many Q4M tables or if other storage engines also used a lot of memory.  I am considering of just disabling the use of mmap(2) on 32bit systems in future releases.  If you have any comments on this, please let me know.

January 20, 2010

Building a highly configurable, easy-to-maintain backup solution for LVM-based VMs and MySQL databases

Motives and the Features

For the servers running in our new network, I was in need for a highly configurable, but easy-to-use backup solution that can take online backups of VMs and MySQL databases running multiple storage engines.

Since my colleagues are all researchers or programmers but there are no dedicated engineers for managing our system, I decided to write a set of command line scripts to accomplish the task instead of using an existing, highly-configurable but time-taking-to-learn backup solutions, like Amanda.

And what I have come up with now is a backup solution with following characteristics, let me introduce them.

  • a central backup server able to take backup of other servers over SSH using public-key authentication
  • no need to install backup agents into each server
  • LVM snapshot-based online, incremental backups (capable of taking online backups of LVM-based VMs)
  • taking online backups of MySQL databases running mulitple storage engines with sophisticated lock control
  • no configuration files, only use crontab and shell-scripts

The solution consists of two tools, blockdiff (kazuho's blockdiff at master - GitHub), and cronlog script of kaztools (kazuho's kaztools at master - GitHub).

Blockdiff

Blockdiff is a set of scripts for taking block-based diffs of files or volumes on a local machine or on remote machines over the network using SSH.  The script below takes online backup of three LVM volumes from three servers.  In the form below, a full backup will be taken once a month, and incremetal backups will be taken during every month.

backup.sh
#! /bin/bash

export YEARMONTH=`date '+%Y%m'`

# backup a LVM volume (using snapshot) at /dev/pv/lv on srv00
blockdiff_backup /var/backup/srv00-pv-lv-$YEARMONTH ssh_lvm_dump --gzip \
  root@srv00 /dev/pv/lv \
  || exit $?

# backup another LVM volume on an another server
blockdiff_backup /var/backup/srv01-pv-lv-$YEARMONTH ssh_lvm_dump --gzip \
  root@srv01 /dev/pv/lv \
  || exit $?


# backup a MySQL database stored on volume /dev/pv/lv on server db00
BLOCKSIZE=16384 \
  LVCREATE_PREFIX='mysqllock --host=db00 --user=root --password=XXXX' \
  blockdiff_backup /var/backup/db00-pv-lv-$YEARMONTH ssh_lvm_dump --gzip \
  root@db00 /dev/pv/lv \
  || exit $?

The backup command of the last volume uses mysqllock command included in blockdiff to keep "FLUSH TABLES WITH WRITE LOCK" running while taking a snapshot of the LVM volume on which the database files exist.  It is also possible to implement other kinds of locks so as not to issue the "FLUSH TABLES WITH WRITE LOCK" while long-running queries are in execution.  Since the flush statement blocks other queries until all of the already running queries complete, issuing the flush query when long-running queries exist will lead to the database not responding to other queries for a certain amount of time.

Crontab and the cronlog script

The backup script is invoked by cron via cronlog, a script that logs the output of the executed task, as well as controlling the output passed to cron so that an alert mail will be sent when the backup script fails.  It uses setlock command of daemontools for holding an exclusive lock while running the backup script (and to alert the administrator on when failing to acquire the lock).

Crontab script
MAILTO=admin@example.com
5 3 * * * cd /var/backup && exec setlock -nX /tmp/backup.lock cronlog -l /var/backup/backup.log -t -- ./backup.sh 2>&1

This is all that needs to be set up to backup LVM volumes including MySQL databases.  Output of the log will be like the following.

backup.log
------------------------------------------------------------------------------
[Sat Jan  9 03:05:02 2010] backup-srv starting: ./backup.sh
[Sat Jan  9 03:05:02 2010] creating snapshot...
[Sat Jan  9 03:05:07 2010]   Logical volume "lvm_dump" created
[Sat Jan  9 03:05:07 2010] running: ssh_blockdiff_dump --gzip "root@srv00" "/dev/pv/lv"...
[Sat Jan  9 03:19:22 2010] removing snapshot /dev/pv/lvm_dump...
[Sat Jan  9 03:19:23 2010]   Logical volume "lvm_dump" successfully removed
[Sat Jan  9 03:19:23 2010] backup completed successfully
(snip)
[Sat Jan  9 03:35:56 2010] creating snapshot...
[Sat Jan  9 03:35:56 2010] issuing lock statement: FLUSH TABLES WITH READ LOCK
[Sat Jan  9 03:36:00 2010]   Logical volume "lvm_dump" created
[Sat Jan  9 03:36:00 2010] issuing unlock statement: UNLOCK TABLES
[Sat Jan  9 03:36:00 2010] running: bin/ssh_blockdiff_dump --gzip "root@db00" "/dev/pv/lv"...
[Sat Jan  9 04:18:44 2010] removing snapshot /dev/pv/lvm_dump...
[Sat Jan  9 04:18:46 2010]   Logical volume "lvm_dump" successfully removed
[Sat Jan  9 04:18:46 2010] backup completed successfully
[Sat Jan  9 04:18:46 2010] command exited with code:0

The files in the backup directory will be like below.  The .gz files contain the backup data, and .md5 files contain per-block checksums used for taking incremental or differential backups.

The backup files
% ls -l db00-pv-lv-201001*
-rw-r--r-- 1 backup backup 50289166539 2010-01-01 05:35 db00-pv-lv-201001.1.gz
-rw-r--r-- 1 backup backup   131072004 2010-01-01 05:35 db00-pv-lv-201001.1.md5
-rw-r--r-- 1 backup backup 10914423057 2010-01-02 04:32 db00-pv-lv-201001.2.gz
-rw-r--r-- 1 backup backup   131072004 2010-01-02 04:32 db00-pv-lv-201001.2.md5
-rw-r--r-- 1 backup backup 13648250036 2010-01-03 04:33 db00-pv-lv-201001.3.gz
-rw-r--r-- 1 backup backup   131072004 2010-01-03 04:34 db00-pv-lv-201001.3.md5
(snip)
-rw-r--r-- 1 backup backup           3 2010-01-18 04:34 db00-pv-lv-201001.ver

For more information, please read the source code and the accompanying documentation.

Conclusion

As can be seen, this is a powerful backup solution that can be built up with minimum setup. It will work well if you work in a small number of experienced engineers, while it might not be suitable for large-scale deployments with many admins.  If you are interested, please give it a try.  I am looking forward to your ideas and / or suggestions.

PS. The blockdiff_merge command can be used to restore the backups.

January 06, 2010

Q4M 0.9 released

Q4M 0.9 is now available from q4m.31tools.com, along with prebuilt binaries for MySQL 5.1.42 (running on linux and Mac OS X 10.5).

The release fixes a deadlock issue under high load.  Thank you to the users for sending the necessary information to resolve the issue.

If you still find any deadlocks or stability problems of Q4M, please send stacktraces of all threads (by attaching gdb to mysqld and executing "thread apply all bt").

December 15, 2009

[ANN] CGI::Application::Emulate::PSGI

From: Kazuho Oku
To: cgiapp-ml

Hi,

I have just uploaded CGI::Application::Emulate::PSGI onto CPAN.

It's a yet-another adapter for running CGI::Application on top of PSGI. The differences from the older one: CGI::App::PSGI is that CGI::App::Emulate::PSGI is more friendly to existing code in two aspects. From the POD,

uses CGI.pm instead of CGI::PSGI
CGI::Application::PSGI (that uses CGI::PSGI) does not support programs calling CGI.pm in func-style (like CGI::virtual_host()). CGI::Application::Emulate::PSGI sets up environment variables so that codes using CGI.pm will work.
compatible with CGI::Application::Dispatch
The interface of CGI::Application::Emulate::PSGI is different from CGI::Application::PSGI, and is compatible with CGI::Application::Dispatch.

Thanks to MARKSTOS and miyagawa for answering my question (https://rt.cpan.org/Public/Bug/Display.html?id=52756), I agree that the ideal way would be to run CGI::App::Dispatch would be to make it somehow run on CGI::Application::PSGI (since it would be cleaner in the fact that it doesn't modify %ENV for each HTTP request), CGI::App::Emulate::PSGI would be a good solution IMO for running existing CGI::App-based applications on top of PSGI.

December 14, 2009

Comparing InnoDB performance on HDD, SSD, in-memory

The chart shows benchmark results taken using sysbench.  Rough understanding would be that (for this scenario) the performance ratio is HDD:SSD:in-memory = 1:10:50.

transactions/sec. read/write reqs./sec.
buffer_pool=8M, HDD 19.93 378.59
buffer_pool=8M, SSD (Intel X25-M) 207.70 3946.29
buffer_pool=2048M, HDD 998.82 18977.51

Details:

The benchmark was taken using MySQL 5.1.41 using innodb_plugin running on linux 2.6.31/x86_64 (Ubuntu 9.10 server).  Options passed to sysbench were: --test=oltp --db-driver=mysql --mysql-table-engine=innodb --oltp-table-size=1000000 --mysql-user=root --mysql-db=test --oltp-table-name=test_t --num-threads=20 .  My.cnf was set as follows.

max_allowed_packet=16777216
query_cache_size=0
default-storage-engine=INNODB
innodb_buffer_pool_size=2048M
innodb_read_io_threads=4           # set to 10 for X25-M
innodb_log_file_size=64M
innodb_flush_method=O_DIRECT
innodb_file_per_table
innodb_flush_log_at_trx_commit=2   # sufficient for our task
key_buffer_size=64M
myisam_sort_buffer_size=32M
slow-query-log
long_query_time=1

December 09, 2009

Hot-deploying Plack applications using Server::Starter

It's time to eat my own dog food.

During YAPC::Asia 2009, I wrote a tiny module called Server::Starter, a superdaemon for hot-deploying TCP servers (see Writing Hot-deployable servers (introduction of Server::Starter) for the benefits of using the module).  On the other hand, I have been involved in developing the http servers in Plack since then.

And time has come.  Having decided to use the two in my new web application server, I wrote a module to glue them together, Kazuho Oku / Plack-Server-Standalone-Prefork-Server-Starter - search.cpan.org.

Although the name is pretty long, it is easy to use :-p

// start server
# start_server --port=80 -- plackup -s Standalone::Prefork::Server::Starter myapp.psgi

// or use setuidgid to drop privileges
# start_server --port=80 -- setuidgid www plackup -s Standalone::Prefork::Server::Starter myapp.psgi

// graceful update
# killall -HUP start_server

Or use svc -h if you are going to manage Server::Starter using daemontools.

October 08, 2009

Cppref: reading cppreference.com docs offline, like man or info or perldoc

Today I created a tiny script called cppref, a wrapper for documents on cppreference.com.  Blurbs are:

  • docs are bundled with the interface, no network access required
  • works like man(1) or info(2) or perldoc(2)

It looks like follows.

$ cppref
You are here: C++ Reference

C++ Reference

                                  C++

General Topics                      * C++ Strings
                                    * C++ I/O
  * FAQ                                 + C++ String Streams
  * Pre-processor commands          * C++ Exceptions
  * Operator Precedence
  * Escape Sequences              C++ Standard Template Library (STL)
  * ASCII Chart
  * Data Types                      * Overview
  * Keywords                        * Iterators
                                    * C++ Algorithms
Standard C Library                  * C++ Vectors
                                    * C++ Double-Ended Queues
  * Overview                        * C++ Lists
  * Standard C I/O                  * C++ Priority Queues
  * Standard C String & Character   * C++ Queues
  * Standard C Math                 * C++ Stacks
  * Standard C Time & Date          * C++ Sets
  * Standard C Memory               * C++ Multisets
  * Other standard C functions      * C++ Maps
                                    * C++ Multimaps
                                    * C++ Bitsets

$ cppref vector
You are here: C++ Reference >> C++ Standard Template Library >> C++ Vectors

C++ Vectors

Vectors contain contiguous elements stored as an array.

Accessing members of a vector can be done in constant time, appending elements
to a vector can be done in amortized constant time, whereas locating a specific
value or inserting elements into the vector takes linear time.

Constructors create vectors and initialize them with some data
Operators    compare, assign, and access elements of a vector
assign       assign elements to a vector
at           returns an element at a specific location
back         returns a reference to last element of a vector
(snip)

Or if the specified term maps to multiple files,

$ cppref push_back        
multiple choices:
  stl::deque::push_back
  stl::list::push_back
  stl::vector::push_back
  string::push_back

By default, cppref uses w3m as its viewer, so you can follow the links to read the documents.

Cppref is available from search.cpan.org/dist/cppref or github.com/kazuho/cppref.  Hove fun!

October 07, 2009

Uploading an autotools-based distribution onto CPAN

Background

It is a pain to create binary packages.  But installing a program from source tarball is a tedious task.  You need to run ./configure & make && make install.  Sometimes you need to resolve the dependencies by hand as well.  That's where source-code-based package distribution systems come in, and the largest system is, IMHO, CPAN.  If you could upload a autotools-based distribution onto CPAN, then the users of the software can install them with the cpan command (or cpanp or cpanf or whatever), with the dependencies automatically resolved.

And for my case, it was considered especially benefitial, since the program I am now working on (it's called incline, a replicator for RDB shards using MySQL or PostgreSQL), uses perl scripts for running tests.  By distributing incline through CPAN, the perl modules required for running the test suite could be installed automatically.  Besides that, the expected users of the software outside my company are mostly perl users.

So I adjusted the files of incline so that it can be uploaded to CPAN (it continues to work as a normal autotools-based distribution as well).  The steps were as follows (for a working example, please see incline - search.cpan.org).

Step 1. write META.yml generator

CPAN packages should contain a file call META.yml which describes the metainformation of the package like dependencies, etc.  Since it would be a good idea to fill-in some fields of the file by reading other files of the distribution, I wrote a script file (META.yml.pl) that generates META.yml.  As you can see, it is pretty straightforward.  One point to note is that it has a --fix-makefile option that inserts dependency information into Makefile, which is mandatory for supporting old versions of the CPAN installer.

META.yml.pl
#! /usr/bin/perl

use strict;
use warnings;

use YAML;

my $meta = {
    name               => 'incline',
    abstract           => 'a replicator for RDB shards',
    version            => do {
        my $s = `echo VERSION | cpp -include src/incline_config.h`;
        $s =~ s/^.*\n\"([0-9_\.]+)\".*?$/$1/s
            or die "failed to obtain version number";
        $1;
    },
    author             => do {
        open my $fh, '<', 'AUTHORS'
            or die "failed to open AUTHORS:$!";
        my @authors = map {
            chomp $_;
            $_;
        } <$fh>;
        close $fh;
        \@authors;
    },
    license            => 'bsd',
    distribution_type  => 'script',
    dynamic_config     => 0,
    configure_requires => {
        perl          => 5.008,
        YAML          => 0,
    },
    requires           => {},
    build_requires     => {
        DBI                => 0,
        'List::MoreUtils'  => 0,
        'Scope::Guard'     => 0,
        'Test::mysqld'     => 0,
        'Test::postgresql' => 0,
    },
    resources          => {
        license => 'http://www.opensource.org/licenses/bsd-license.php',
    },
    no_index           => {
        directory => [ qw/example src t/ ],
        file      => [ qw/README.html/ ],
    },
    'meta-spec'        => {
        version => 1.4,
        url => 'http://module-build.sourceforge.net/META-spec-v1.4.html',
    },
    generated_by       => 'META.yml.pl',
};

if (@ARGV && $ARGV[0] eq '--fix-makefile') {

    my $prereq_expr = do {
        my %req = (
            %{$meta->{requires}},
            %{$meta->{build_requires}},
        );
        join ", ", map { "$_=>q[$req{$_}]" } sort keys %req;
    };
    print <<"EOT";
# MakeMaker Parameters:

#  PREREQ_PM => { $prereq_expr }

# --- MakeMaker post_initialize section:

EOT
    while (my $l = <STDIN>) {
        print $l;
    }

} else {

    print Dump($meta);

}

Step 2. call META.yml.pl from configure

As explained, older versions of the CPAN installer requires the dependencies to be written in Makefile (this is still the case for newer versions if your configuration script dynamically determines the dependencies).  In order to meet the requirement, I added a call to META.yml.pl at the end of my configure.ac.

configure.ac
AC_OUTPUT(Makefile src/Makefile)

cp Makefile Makefile.orig && perl META.yml.pl --fix-makefile < Makefile.orig > Makefile
if test "$?" -ne "0"
then
AC_MSG_ERROR([failed to add metainfo to Makefile])
fi

Step 3. adjust Makefile

There are a couple of things to adjust in the Maefile.  First, make test should be supported.  In my case, I already had one, that executes test scripts written in perl using Test::Harness.  The next thing is to include more than one POD (plain old document).  I already had one as well, since I generate man pages and readmes from POD.  The last thing is to generate and include two files in the distribution, META.yml and MANIFEST.  The Makefile uses a dependency and dist-hook to generate the files.  The files adde should be included in the EXTRA_DIST section of Makefile.am, of course.

Makefile.am
META.yml: META.yml.pl
perl $< > $@

dist-hook:
(cd $(distdir) && find * -type f) > $(distdir)/MANIFEST

test:
perl -MTest::Harness -we 'runtests(@ARGV)' t/*.t

Step 4. write Makefile.PL

Create Makefile.PL that calls ./configure.  Do not write any configuration logic in Makefile.PL, all logic should go into configure.ac so that the distribution could be built by using either Makefile.PL or configure.

Makefile.PL
exec './configure', @ARGV;
die "failed to execute configure:$!";

Conclusion

As described, it is not (so) difficult to add CPAN compatibility to existing autotools-based distributions, or to create a new distribution that works as both.  But to tell the truth it was not as easy as it seems to figure out the right way.  It would have been impossible without help from charbar, migayawa, tokuhirom, and walf443.  I would like to thank them all.

September 11, 2009

A Clever way to scale-out a web application (YAPC::Asia 2009 Presentation)

For couple of months I have been writing middlewares for database shards, and today I made a presentation covering them.  It includes the following.

  • Incline - a trigger and queue based distributed materialized view manager
  • Pacific - a set of perl scripts to manage MySQL shards, a MySQL shard can be split into two in less than 10 seconds of write blocking (and no read blocks)
  • DBIx::ShardManager - a client API for accessing database shards using Incline and Pacific

With these middlewares I think it is no more difficult to write web applications that runs on database shards.  In fact IMHO it is as easy as writing a webapp that runs on a standalone database.

The presentation slides are available from slideshare.  If you have any question or suggestions, please leave a comment.  Thank you.

Writing Hot-deployable servers (introduction of Server::Starter)

Yesterday at YAPC::Asia 2009, I did a LT introducing verious techniques to write hot-deployable servers, and introduced a perl module called Server::Starter that encapsulates the burden of developing support for hot-deployment within each TCP server program. The presentation slides are on Slideshare.

August 26, 2009

Picoev: a tiny event loop for network applications, faster than libevent or libev

I am sure many programmers writing network applications have their own abstracting layers hiding the differences between various I/O multiplex APIs, like select(2), poll(2), epoll(2), ... And of course, I am one among them.  While writing mycached (see Mycached: memcached protocol support for MySQL for more information), I was at first considering of using libev for multiplexing socket I/Os.  Libevent was not an option since it does not (yet) provide multithreading support.

But it was a great pain for me to learn how to use libev.  I do not mean that its is an ugly product.  In fact, I think that it is a very well written, excellent library.  However, for me it was too much a boring task to learn how the things are abstracted, already being familiar with the problems it tries to hide.

So instead I thought it might be a good occasion to write my own library that could be used in any programs I may write in the future.  The result is picoev, and it is faster than libevent or libev!  The benchmark used is a re-modified version taken from libev.schmorp.de/bench.html and can be found here.

Why is it faster?  It is because it uses an array and a ring buffer of bit vectors as its internal structure.  Libevent and libev seem to use some kind of sorted tree to represent file descriptors.  However, if we concentrate on Un*x systems, there is a guarantee that the descriptors will a small positive integer.  Picoev utilizes the fact and stores information related to file descriptors (such as pointers to callback functions or callback arguments) in an array, resulting in a faster manipulation of socket states.

Another optimization technique used by picoev is not to use an ordered tree for keeping timeout information.  Generally speaking, most network applications do not require accurate timeouts.  Thus it is possible to use a ring buffer (a sliding array) of bit vectors for the purpose.  Each bit vector represents a set of file descriptors that time-outs at a given time.  Picoev uses 128 of bit vectors to represent timeouts, for example, the first bit vector represents the sockets that timeout a second after, the second bit vector representing them of two seconds after..., and the bit vectors slide every second.  If the maximum timeout required by the web application is greater than 128, the minimum granurality of timeout becomes two seconds.

I would like to reiterate that both libevent and libev are great libraries.  Picoev is not at comparable to them especially in maturity and the number of features.  It only supports select(2), epoll(2), and kqueue(2) for the time being.  However the design is simple than the other two, and I think it will be a good starting point to write network applications, or to use as a basis for writing one's own network libraries.

If you are interested in using picoev, the source code can be found at coderepos.org/share/browser/lang/c/picoev/.

Mycached: memcached protocol support for MySQL

It is a well-known fact that the bottlenecks of MySQL does not exist in its storage engines, but rather in the core, for example, its parser and execution planner.  Last weekend I started to wonder how fast MySQL could be if those bottlenecks were skipped.  Not being able to stop my curiousity, I started adding memcached proctol support to MySQL as a UDF.  And that is Mycached.

From what I understand, there are two advantages of using mycached (or the memcached protocol, in general) over using SQL.  One is faster access.  The QPS (queries per second) of mycached is roughly 2x compared to using SQL.  The other is higher concurrency.  As can be seen in the chart below, mycached can handle thousands of connections simultaneously.

And the lack of the described advantages has been the reasons for large-scale web applications to use memcached, even though it requires application developers to take care of keeping consistency between mysql and memcached.  In other words, mycached could become a good alternative here.

Mycached is still in very early stage of development.  I would never recommend using current version in a production environment, but if you are interested, the details of how to use it is as follows.

Source code of mycached is available at coderepos.org/share/browser/platform/mysql/mycached.  There is no Makefile or configure script yet, use GCC directly to create the UDF.  Source code of mysql is needed and should be set as include directories (you need to run make on the source code of mysql prior to compiling mycached).

% g++ -DMYCACHED_USE_EPOLL=1 -shared -fPIC -Wall -g -O2 -I ~/test/mysql-5.1.37/src/include -I ~/test/mysql-5.1.37/src/sql -I ~/test/mysql-5.1.37/src/regex mycached_as_udf.cc -o mycached_as_udf.so

Once the compilation succeeds, copy the shared library into the plugins directory of the lib directory of mysql, and use the CREATE FUNCTION statements to activate the necessary UDFs.  And to start mycached, call mycached_start.  The function takes three arguments: host, port, and number of worker threads to be used for handling requests.  To stop the mycached, call mycached_stop().

mysql> CREATE FUNCTION mycached_start RETURNS INT SONAME 'mycached_as_udf.so';
mysql> CREATE FUNCTION mycached_stop RETURNS INT SONAME 'mycached_as_udf.so';
mysql> SELECT mycached_start(0, 11211, 4);

The only memcached command currently supported by mycached is get (but you can do other things using SQL :-p).  As the following example shows, specify more than one db_name.table_name.primary_key to fetch a row of a table, separated by a whitespace.  The returned value will be a repetition of columns, each of them in the format of column_name:value_length:value.  It is also possible to specify JSON as the response format by adding ":JSON", however it will be slower, since the server needs to convert the values of the columns to UTF8 and to escape them.

$ telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
get test.t1.1
VALUE test.t1.1 0 68
id:1:1message:51:hello world! hello world! hello world! hello world!
END
get test.t1.1:json
VALUE test.t1.1 0 72
{"id":1,"message":"hello world! hello world! hello world! hello world!"}
END

Thank you for reading.  If you have any ideas or suggestions, please leave a comment.

August 06, 2009

Deployment of MySQL using daemontools, XtraBackup

I am sure many people have already done similar things, but to ease my pain of setting up mysqld on a large-scale environment (I am trying to create a set of database nodes, each node consists of a MySQL failover cluster using semi-sync replication, that can be administered easily), I have just finished writing a deployment script called mysqld_jumpstart.  The caveats are:

  • integration with daemontools (mysqld is automatically started)
  • setup of masters and slaves
  • can setup slaves from backup data generated by XtraBackup

The last feature was the one I especially needed, since thanks to the people at Percona, things have become much easier with XtraBackup (or with InnoDB Hot Backup) since there is no more need to detach a mysqld before creating a backups.

Setting up and starting a master database:

# ssh root@10.0.1.1 mysqld_jumpstart \
--mysql-install-db=/usr/local/mysql/scripts/mysql_install_db \
--mysqld=/usr/local/mysql/bin/mysqld --base-dir=/var/my_webapp/mysql \
--data-dir=/var/datadrive/my_webapp --server-id=10 \
--replication-user=repl --replication-password=replpass \
--replication-network='10.0.0.0/255.0.0.0'

To create and a start a slave from a backup by XtraBackup (specified by --data-dir):

# ssh root@10.0.1.2 mysqld_jumpstart \
--mysql-install-db=/usr/local/mysql/scripts/mysql_install_db \
--mysqld=/usr/local/mysql/bin/mysqld --base-dir=/var/my_webapp/mysql \
--data-dir=/var/datadrive/my_webapp --server-id=10 \
--replication-user=repl --replication-password=replpass \
--replication-network='10.0.0.0/255.0.0.0' \
--master-host=10.0.1.1 --from-innobackupex

Mysqld_jumpstart can be downloaded from coderepos.org/share/browser/platform/mysql/mysqld_jumpstart.  Use --help to find out how to use the script.  FYI to run the tests, you need to apply a patch to XtraBackup.

July 22, 2009

Using a statistical approach to analyze / monitor MySQL bottleneck queries

What comes up to your mind when you hear the phrase "MySQL performance tuning?"  Most of the discussions covering the topic are things like server parameter tuning and blog entries describing the use of the "EXPLAIN" command.  However, it is too time-consuming to check every single SQL query.  A better approach is at first to determine the SQL queries that are actually consuming the server resources, and then to optimize those queries.

So the question is how to find out the bottleneck queries.  At MySQL Conference & Expo 2009, Mark Callaghan explained in his keynote that Google was taking a statistical approach using the "SHOW PROCESSLIST" as a solution.  The same command is mentioned as a rival approach in the webpages of MySQL Query Analyzer as well.  Today, having a need for something alike, I wrote a set of tiny script files under the name "mprofile" that does the same thing.  It can be used as follows.

Sampling SQL queries

Use mpdump to sample running queries from mysqld periodically.  The default setting is to fetch 1,000 samples in 0.1 second interval  (in other words, it would take a little more than 100 seconds to execute).  Please use the --help option to find out the way to specify the number of samples, the interval, and connection parameters to mysqld can be specified as command line parameters.

% mpdump > mpdump.txt

Merging the queries and generating the report

Use mpfilter command to rewrite the variables in the sampled queries, then feed the result into mpreport command to sort the queries from the heaviest one to the lighter.  Each line of the report is in the format "probability:SQL query," where probability is the average percentage of if the SQL query is executed.  The example below is taken from the mysqld of Pathtraq.  It is obvious that which queries are consuming the resources (and that the heaviest query is running at the probability of 138.3%: it is run by 1.383 threads by average).

% mpfilter < mpdump.txt | mpreport | head -100
138.300:select pt_sum(hits,?) as cnt from url inner join counter on url.id=counter.url where (pt_url_compress(?)<=rurlc and rurlc<pt_url_compress(?,?))
99.100:insert into tt select t1.url,cnt_mixed&? as cnt,********** where url.status_code=?
94.200:select id,pt_url_decompress(rurlc) as rurl from url where crawl_cnt<? and crawl_at>=? and crawl_at<?-?*pow(?,crawl_cnt) limit ?
22.000:select id,status_code from url where rurlc=pt_url_compress(?)
14.300:select id,pt_url_decompress(rurlc) as rurl,cnt from (select url,cnt from counter_by_day **********
9.300:insert into counter (url,referer,hits) values (?,?,pt_cnt_add(null,?,?)) on duplicate key update hits=pt_cnt_add(hits,?,?)
7.400:insert ignore into **********
6.100:select pt_sum(hits,?) as cnt from counter where url=?
5.900:select (crawl_at<? and -crawl_at<? and pt_sum(hits,floor(-crawl_at/?))>?) as recrawl from url inner join counter on url.id=counter.url where id=?
5.700:insert into **********
5.700:select t1.id,pt_url_decompress(rurlc) as rurl,cnt,score from **********
4.600:select id,name,dispname from category_ja order by disporder,id
4.300:select addr from **********
3.200:update url set crawl_at=?,lang=?,status_code=if(status_code=?,?,?) where i
d=?
(snip)

By using the report, we can decide from which query to start optimizing, by using explain, or by letting their responses be cached by memcached, etc.  It is also possible to replace mpfilter with a house-made script to customize the report.

Conclusion

There are other ways to find out bottlneck queries, like using the slow query log or MySQL Enterprise Monitor.  Comparing with them, the merits of using the described solution using "SHOW PROCESSLIST" are

  • no need to change current environment
  • practically no overhead
  • potential to build a customized performance tuning / monitoring tool
As an example of using "SHOW PROCESSLIST" for other purposes than performance tuning, it would be possible to keep a daily sample log of SQL queries using mprofile, and use them to find out the long-term changes of the bottlenecks to build server reinforcement plans or for refactoring the design of the application.

Finally, the source code of mprofile is available from /platform/mysql/mprofile – CodeRepos::Share – Trac.  Have fun!

July 14, 2009

Intruducing Incline - a synchronization tool for RDB shards

For the last weeks, I have been writing a tool called "Incline," a program that automatically maintains consistency between sharded MySQL databases.  The aim of the software is to free application developers from hand-writing code for keeping consistency between RDB nodes, so that they can concentrate on writing the application logic.

Background

Denormalization is unavoidable in a sharded RDB environment.  For example, when a message is sent from a user to another user, the information should be stored on a database node where the sender of the message belongs to, and on another node where the receiver does.  In most cases, denormalization logic is hand-written by web application developers, and since it has been a burden for creating large-scale web services.  Incline takes off the load from developers.  By reading the definition files, Incline keeps the tables on a sharded MySQL environment in sync, by providing a trigger-generator and a replication program that, synchronously or asynchronously reflects the changes of source tables into materialized-view-like tables.

Installing Incline

Incline is written in C++ and uses autotools for building the tools.  However, since I have not yet added automatic lookup for libmysqlclient, you need to specify their location manually to build Incline.  My build procedure is as follows.

% svn co http://kazuho.31tools.com/svn/incline/trunk incline
% cd incline
% autoreconf -i
% ./configure CFLAGS=-I/usr/local/mysql/include/mysql CXXFLAGS=-I/usr/local/mysql/include/mysql LDFLAGS='-L/usr/local/mysql/lib/mysql -lmysqlclient'
% make
% make install

Defining Replication Rules

Replication rules of Incline are written using JSON files.  Consider creating a twitter-like microblog on a shared environment.  It would be consisted of four tables, each of them distributed through RDB shards by the user_id column.  The "tweet" table and "following" table are updated by user actions, while "followed_by" table and "timeline" table are denormalized tables that need to be kept synchronized to the former two.

CREATE TABLE tweet (
  tweet_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  user_id INT UNSIGNED NOT NULL,
  body VARCHAR(255) NOT NULL,
  ctime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (tweet_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE following (
  user_id INT UNSIGNED NOT NULL,
  following_id INT UNSIGNED NOT NULL,
  PRIMARY KEY (user_id,following_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE followed_by (
  user_id INT UNSIGNED NOT NULL,
  follower_id INT UNSIGNED NOT NULL,
  PRIMARY KEY (user_id,follower_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE timeline (
  user_id INT UNSIGNED NOT NULL,
  tweet_user_id INT UNSIGNED NOT NULL,
  tweet_id INT UNSIGNED NOT NULL,
  ctime TIMESTAMP NOT NULL,
  PRIMARY KEY (user_id,tweet_user_id,tweet_id),
  KEY user_ctime_tweet_user_tweet (user_id,ctime,tweet_user_id,tweet_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Incline uses two JSON files to define a replication rule.  The first file, microblog.json defines the mapping of columns using directives "pk_columns" (primary key columns) and "npk_columns" (non-pk_columns) between the "source" table(s) and "destination" of the replication.  When merging more than two tables to a single destination, "merge_cond" attribute is used to define the inner join condition(s).

microblog.json
[
  {
    "source"      : [ "tweet", "followed_by" ],
    "destination" : "timeline",
    "pk_columns"  : {
      "followed_by.follower_id" : "user_id",
      "tweet.user_id"           : "tweet_user_id",
      "tweet.tweet_id"          : "tweet_id"
    },
    "npk_columns" : {
      "tweet.ctime" : "ctime"
    },
    "merge"       : {
      "tweet.user_id" : "followed_by.user_id"
    },
    "shard-key"   : "user_id"
  },
  {
    "source"      : "following",
    "destination" : "followed_by",
    "pk_columns"  : {
       "following.following_id" : "user_id",
       "following.user_id"      : "follower_id"
    },
    "shard-key"   : "user_id"
  }
]

The second file shard.json defines the mapping between user_id and RDB nodes.  Range-based sharding (on an integer column) is specified in the example.  Other algorithm currently supported are: range-str-case-sensitive and hash-int.

shard.json
{
  "algorithm" : "range-int",
  "map"       : {
    "0"    : "10.0.1.1:3306",
    "1000" : "10.0.1.2:3306",
    "2000" : "10.0.1.3:3306",
    "3000" : "10.0.1.4:3306"
  }
}

With the definition, tables will be synchorized in the direction described in the figure below (only illustrates two nodes).

Running Incline

To run Incline, queue tables should be created and database triggers need to be installed on each RDB node.  This can be done by calling "incline create-queue" and "incline create-trigger" commands for each node, and the setup is complete.

% incline --mode=shard --source=microblog.json --shard-source=shard.json --database=microblog --mysql-host=10.0.1.1 create-queue
% incline --mode=shard --source=microblog.json --shard-source=shard.json --database=microblog --mysql-host=10.0.1.1 create-trigger
% incline --mode=shard --source=microblog.json --shard-source=shard.json --database=microblog --mysql-host=10.0.1.2 create-queue
% incline --mode=shard --source=microblog.json --shard-source=shard.json --database=microblog --mysql-host=10.0.1.2 create-trigger
...

The installed triggers take care of synchronizing the denormalized tables within each node.  The next step is to run the forwarder (replicator between RDB nodes) for each node, so that the views should be kept in sync.  This can be done by calling "incline forward."The forwarding logic is defined in a way that the data stored in RDB nodes would become eventually synchronized even if either of the RDB nodes or the forwarder unexpectedly dies.  Use of daemontools or alike to automatically (re)start the forwarder is desirable.

% incline --mode=shard --source=microblog.json --shard-source=shard.json --database=microblog --mysql-host=10.0.1.1 forward &
% incline --mode=shard --source=microblog.json --shard-source=shard.json --database=microblog --mysql-host=10.0.1.2 forward &
...

And this is it.  The two tables: "followed_by" and "timeline" are updated automatically when the "tweet" or "following" table is modified.  Application developers do not need to take care of shard consistency any more, all that needs to be remembered is that changes should be written into the latter two (modification to the former two can be disabled by using different access privileges between the web application and the Incline forwarder).

Next Steps

Having started writing code late last month, Incline is still immature.  Current status is somewhere around early beta.  In the next months, I plan to polish up and optimize the code, add PostgreSQL support (currently Incline works only on MySQL), as well as extending it so that it could used together with Pacific, a framework I have been working, that provides dynamic node addition / removal of nodes on a sharded RDB environment.  Thank you for reading, please stay tuned.

July 06, 2009

PicoJSON, a header-file-only JSON parser in C++

Last week I had been looking for a tiny JSON parser written in C++, but was not able find one that met my needs (jsonxx was the most interesting, I took its is<type> and get<type> interfaces).  So instead I wrote my own, and that is PicoJSON.  The advantages of PicoJSON over other C++ JSON parsers are as follows.

  • C++ header-only implementation (just include the header file from your source code and that's it)
  • standalone (does not depend on other libraries like boost)
  • integration with STL (JSON array represented as std::vector, JSON object as std::map)
  • operator<<, operator>> and iterator based parser / serializer interface
  • provided under the New BSD License

The header file is available from coderepos (link).  Please read the examples included to find out how to use the library.

Related Articles:

June 01, 2009

My weblog URL changed (ブログ移転のおしらせ)

My weblog has moved from labs.cybozu.co.jp/blog/kazuho (in Japanese), labs.cybozu.co.jp/blog/kazuhoatwork (in English) to this new location.  The latter has been merged as in English category of the new weblog.

RSS and Atom feeds have been redirected permanently to that of the new weblog. Old weblogs will be kept open as static pages.