Tuesday, March 1, 2011

Tuesday, September 2, 2008

Database Sharding Vs. Mysql Clusters

Databases have been an integral part of information technology for a long time. They have taken various forms and shapes, relational, object oriented, hierarchical, in-memory, replicated, partitioned but mostly residing on the same machine. But the present internet party has more participants than ever before. And almost every action carried out by these netizens causes some results, which are stored in some database somewhere. Actions range from creating/deleting accounts, searching a phrase, commenting, rating to just clicking an ad, navigating a link or scrolling through a page. No single database can record the volume of information generated. Since only scaling-up is not an option anymore, more and more businesses are opting for scaling-out (may be in conjunction with scaling-up). Scaling-up is when you add more power to the existing machine or replace the existing machine with a more powerful one. In scaling-out more less expensive nodes are added to the system and the load is shared. In our case, the database is divided and stored on multiple machines. This brings us to sharding.

Broken glass in St Patricks Cathedral by féileacán


What is Sharding?

Sharding is the concept of splitting the table horizontally, and sometimes vertically, each split or shard stored on a different server. This is 'database partitioning' provided by the database providers and more. While partitioning allows only partitioning by primary key or index, whereas these restrictions do not exist in sharding. Master-Slave replication can also be done in sharding to deliver better availability. Please note that in such cases replication helps provide high availability and does not address scalability issues. For example, a database of search terms can be sharded based on the country it originated from. One server would be dedicated to store search terms of USA whereas another would store search terms from countries in Africa.

Sharding : Common Features

Since in sharding, databases are horizontally partitioned and spread across machines, joins are expensive or out-right not possible. Hence normalised tables are not optimal. Related data are put together in the same table, ensuring the queries are minimal. The nodes are less-expensive commodity machines. Blobs and large texts are moved to different servers ensuring the load is balanced. Since the shards function in the capacity of master databases, this is ideal for applications that have high write/read ratio.

See scalability.org for more on Sharding

MYSQL Clustering:

Of recent I found out about MySQL Clustering.

This is MySQL Clustering as described in mysql.com

"MySQL Cluster is a real-time open source transactional database designed for fast, always-on access to data under high throughput conditions. Cluster utilizes a 'shared nothing' architecture which does not require any additional infrastructure investment to provide 99.999% data availability with no single point of failure. "

A first time look gives an impression that it is quite similar to shards. Is it really? I tried to find out and tried to read about MySQL Cluster and Shards.

MySQL Cluster:

MySQL Cluster is a distributed database that ensures high performance using in-memory databases. It uses a distributed node-based architecture with automatic fast fail over in less than a second. While the whole database is usually loaded in-memory, it is possible in MySQL 6.0 to use disk cache for non-indexed fields. It limits Input/Output bottlenecks by asynchronously writing transactions logs to disk. The architecture is composed of three types of nodes.

Database Nodes:

All data is stored on the Database Nodes. The database transactions are handled by these nodes. The data is replicated in such a way that if a node goes down, there is always some other node that has a replication of the down node.

Application Nodes:

These nodes contain the applications that connect to the database. Failure of a database node does not affect the application.

Management Server Nodes:

These handle the system configuration. The management server nodes are used only at startup and for system reconfiguration and function independent of the other nodes.

This architecture ensures there is no single point of failure. Applications continue to run and data remains consistent, even if any one of the database, application, or management server nodes fail.

While the nodes can be distributed across geographies, the official FAQ of MySQL Clusters cautions against it, as it can greatly increase the response time. But if the cluster is geographically distributed, then the replication of data becomes asynchronous.

Sharding Vs MySQL Cluster

Differences in Intention:

Understanding this is very important as this underlies all the design and architectural decisions taken for the these two entities.

Sharding is designed for

1. Quick Scalability

2. Handling high write/read ratios

MySQL Clusters are designed for

1. Fast Responses

2. Always on 99.999% reliability

3. Handling high read/write ratios

Architectural Differences:

MySQL Clusters are geared to handle fast fail overs automatically. This is ensured by replicating the data in several database nodes to handle random node failures.

While replication is required and can possibly done through Master-Slave systems, this is not emphasised and is not the focus of Shards.

While any number of shards can be added to the system as long as the network supports, the number of nodes in a MySQL Cluster is limited. As of MySQL Cluster 5.16 , the maximum number of nodes a Cluster can have is 63 and the maximum number of supported data nodes is 48.

SQL Operations:

On Shards, joins are inherently impossible. MySQL Clusters support joins but they are much slower than on traditional databases as it has to be run against data on different nodes.

The MySQL uses the NDB storage engine which imposes some restrictions on memory consumed. The free memory that results from a row deletion is not recovered. So other tables cannot use this memory. But if another row is inserted into the same table, it can utilize the data. Only dropping the table will actually release the memory. With Shards , one has the freedom to choose the type of storage engine.

MySQL Clusters do not support a few features of the traditional database systems, owing to using the NDB storage engine. Some of them are Temporary tables, indexes on BLOB or TEXT data types, FULLTEXT indexes, partial transactions, partial rollbacks of transactions.

MySQL Cluster does not handle large transactions well; it is better to perform a number of small transactions with a few operations each than to attempt a single large transaction containing a great many operations.

The limitations of MySQL Cluster on naming is as follows: Attribute names are truncated to 31 characters, and if not unique after truncation give rise to errors. Database names and table names can total a maximum of 122 characters. The maximum number of attributes (that is, columns and indexes) per table is limited to 128.

Horizontal Partitioning:

On Shards the partitioning logic can be defined by the user. So essentially any logic can be employed. As of MySQL Cluster 5.1.6, all Cluster tables are by default partitioned by KEY using the table's primary key as the partitioning key. If no primary key exists, a hidden autoincrement column is created and used as primary key which is also used for partitioning.

On Shards, an user application ensures that the query goes to the shard where it is relevant based on the partitioning logic. This can add more work on the developer side, but reduces the no of queries executed.

For MySQL Cluster, there is a single point of contact to connect to the database, and the developer/user is screened from the internals of where the data is stored. The query may be executed on more than one database, (depending on the query).

Rebalancing Data & Backup Strategies:

In shard, These features have been custom implemented currently. There are some libraries that are coming up to assist the same.

On a MySQL Cluster these are implemented and require least work.

Wednesday, June 25, 2008

Configuring Apache

I have Ubuntu Hardy Heron installed on my Lenovo Z60m. I wanted to try installing apache 2.2.9 without using the package manager apt-get with options

./configure --enable-modules=all --enable-mods-shared=all --enable-mem-cache --enable-disk-cache --enable-example --enable-so

But it failed with

configure: WARNING: Unrecognized options

This remained true for any configuration option except --help.


If you find yourself in this situation:
1. Ensure build-essentials is installed
2. Ensure gcc and its libraries are installed
3. go to the /untarred/apache/httpd directory. run './buildconf'

Hope this helps!

Tuesday, June 24, 2008

Yahoo! Babel Fish Vs Google Translate : A Case in Point

I was planning/dreaming to create and host my website and had some grand ideas for it. I started off by trying to setting up an apache server on my old faithful Lenovo Z60m and Ubuntu Hardy Heron. I was facing a rather obscure error in Apache 2.2.9 configuration.

configure: WARNING: Unrecognized options: --enable-rewrite, --enable-speling

Now that was weird and I started sifting through tonnes and tonnes of google results to find a solution. I came upon a Spanish version of Ubuntu forum that seemed to address my solution but the google's corresponding translated page was not available.

So I decided to have it translated through one of the free translators available.
The spanish text was as below

Hola:

Lo primero que debes hacer es instalar los que se llaman Fuentes o de
Desarrollo (devel), que son los que tienen librerias de compilacion y
dependencias para casos como el tuyo, trato de ver que modulos debes
instalar, pero intenta con lo que te dije.

Ademas para instalar desde las fuentes como estas haciendo debes tener



instalado C++ y todas sus librerias.

I first tried on Yahoo! Babel Fish and this is what I got!

Hello: First that you must do it is to install those that are called Sources or of Development (devel), that is those that has bookshops of compilation and dependencies for cases like yours, treatment to see that units you must install, but tries and so I said to you. Furthermore to install from the sources since these doing you must have installed C++ and all bookshops.

It was living up true to its name. I was Babel I could not make out anything out of it.


Then I tried Google Translate. Lo and behold! What I got out of it made sense and it was actually well constructed!


Hello:

The first thing you should do is to install so called sources or
Development (devel), which are the libraries of compilation and
units for cases like yours, I try to see that module should
install, but try what you said.

In addition to installing from sources such as these you have done
C + + installed and all its bookstores.



Encouraged by these results I tried out a few more :

TranStar:

Hello: The first thing you should do is to install so called sources or Development (devel), which are the libraries of compilation and units for cases like yours, I try to see that you can install modules, but what we try I told you. In addition to installing from sources such as doing you must have C + + and all its bookstores.

Athropolis.com

Hello:

The first thing that you must do is to install those who are called Sources or of
I develop (devel), that are those who have bookstores of compilation and
dependencies for cases as yours, I try to see that you owe modules
to install, but it tries with what I said to you.

Also to install from the sources how are you doing you must have
installed C and all his bookstores.

translation2.paralink.com

Hello:

The first thing that you must do is to install those who are called Sources or of
I develop (devel), that are those who have bookstores of compilation and
dependencies for cases as yours, I try to see that you owe modules
to install, but it tries with what I said to you.

Also to install from the sources how are you doing you must have
installed C and all his bookstores.

Dictionary.com Translator

Hello:

First that you must do it is to install those that are called Sources or of Development (devel), that is those that has bookstores of compilation and dependancies for cases like yours, treatment to see that units you must install, but tries and so I said to you.

Furthermore to install from the sources since these doing you must have installed C++ and all bookstores.



It turned out that the Google Translate was more accurate and more 'english' (for want of a better description) than the rest. But one must bear in mind that the context is highly technical and specific to a rather narrow field of computer science and the tiny experiment is no true test at all. Hence any inference drawn from this is in no way representative of the general corpus of data and does not reflect the true capabilities of the translators. Disassociations apart, now I know where to go if I need a technical page to be translated ;) .

Monday, May 5, 2008

Reinstalling Cisco vpnclient on Ubuntu 8.04

I had recently upgraded to Ubuntu 8.04 on my office workstation. While this is a quite interesting move, some of my custom installed modules have stopped working or removed. This includes vim, eclipse europa etc. The latest to join the list of broken packages is vpnclient. I tried to reinstall vpnclient. While the installation itself went through fine, when I tried to start, it failed with the following error.

sharmi@sharmi:~/Downloads/vpnclient$ sudo /etc/init.d/vpnclient_init start
Starting /opt/cisco-vpnclient/bin/vpnclient: insmod: error inserting '/lib/modules/2.6.24-16-386/CiscoVPN/cisco_ipsec.ko': -1 Invalid module format
Failed (insmod)


That is when I learnt about the new patch required to run the vpnclient with in Ubuntu 8.04

So to reinstall, I will assume that the vpnclient is already downloaded and untarred.
  1. Update the linux-headers to version 2.6.24-16-386 (also called kernel-headers in some distributions)
    • Start the Synaptic Package Manager: System->Administration->Synaptic Package Manager
    • search for 'linux-headers'. Choose linux-headers-2.6.24-16-386 for installation
  2. To apply the patch and install vpnclient
This worked for me :) Hope it helps you too ..

Tuesday, April 22, 2008

Python is Cool! - xkcd.com

I'm a regular reader of xkcd.com and I have come across some really funny & thoughtful works. I was thrilled to see the comics on python as I love code in python too and believe completely in what the comics depict. And I want to share it with you.