Tuesday, September 2, 2008

Database Sharding Vs. Mysql Clusters

Databases have been an integral part of information technology for a long time. They have taken various forms and shapes, relational, object oriented, hierarchical, in-memory, replicated, partitioned but mostly residing on the same machine. But the present internet party has more participants than ever before. And almost every action carried out by these netizens causes some results, which are stored in some database somewhere. Actions range from creating/deleting accounts, searching a phrase, commenting, rating to just clicking an ad, navigating a link or scrolling through a page. No single database can record the volume of information generated. Since only scaling-up is not an option anymore, more and more businesses are opting for scaling-out (may be in conjunction with scaling-up). Scaling-up is when you add more power to the existing machine or replace the existing machine with a more powerful one. In scaling-out more less expensive nodes are added to the system and the load is shared. In our case, the database is divided and stored on multiple machines. This brings us to sharding.

Broken glass in St Patricks Cathedral by féileacán

What is Sharding?

Sharding is the concept of splitting the table horizontally, and sometimes vertically, each split or shard stored on a different server. This is 'database partitioning' provided by the database providers and more. While partitioning allows only partitioning by primary key or index, whereas these restrictions do not exist in sharding. Master-Slave replication can also be done in sharding to deliver better availability. Please note that in such cases replication helps provide high availability and does not address scalability issues. For example, a database of search terms can be sharded based on the country it originated from. One server would be dedicated to store search terms of USA whereas another would store search terms from countries in Africa.

Sharding : Common Features

Since in sharding, databases are horizontally partitioned and spread across machines, joins are expensive or out-right not possible. Hence normalised tables are not optimal. Related data are put together in the same table, ensuring the queries are minimal. The nodes are less-expensive commodity machines. Blobs and large texts are moved to different servers ensuring the load is balanced. Since the shards function in the capacity of master databases, this is ideal for applications that have high write/read ratio.

See scalability.org for more on Sharding

MYSQL Clustering:

Of recent I found out about MySQL Clustering.

This is MySQL Clustering as described in mysql.com

"MySQL Cluster is a real-time open source transactional database designed for fast, always-on access to data under high throughput conditions. Cluster utilizes a 'shared nothing' architecture which does not require any additional infrastructure investment to provide 99.999% data availability with no single point of failure. "

A first time look gives an impression that it is quite similar to shards. Is it really? I tried to find out and tried to read about MySQL Cluster and Shards.

MySQL Cluster:

MySQL Cluster is a distributed database that ensures high performance using in-memory databases. It uses a distributed node-based architecture with automatic fast fail over in less than a second. While the whole database is usually loaded in-memory, it is possible in MySQL 6.0 to use disk cache for non-indexed fields. It limits Input/Output bottlenecks by asynchronously writing transactions logs to disk. The architecture is composed of three types of nodes.

Database Nodes:

All data is stored on the Database Nodes. The database transactions are handled by these nodes. The data is replicated in such a way that if a node goes down, there is always some other node that has a replication of the down node.

Application Nodes:

These nodes contain the applications that connect to the database. Failure of a database node does not affect the application.

Management Server Nodes:

These handle the system configuration. The management server nodes are used only at startup and for system reconfiguration and function independent of the other nodes.

This architecture ensures there is no single point of failure. Applications continue to run and data remains consistent, even if any one of the database, application, or management server nodes fail.

While the nodes can be distributed across geographies, the official FAQ of MySQL Clusters cautions against it, as it can greatly increase the response time. But if the cluster is geographically distributed, then the replication of data becomes asynchronous.

Sharding Vs MySQL Cluster

Differences in Intention:

Understanding this is very important as this underlies all the design and architectural decisions taken for the these two entities.

Sharding is designed for

1. Quick Scalability

2. Handling high write/read ratios

MySQL Clusters are designed for

1. Fast Responses

2. Always on 99.999% reliability

3. Handling high read/write ratios

Architectural Differences:

MySQL Clusters are geared to handle fast fail overs automatically. This is ensured by replicating the data in several database nodes to handle random node failures.

While replication is required and can possibly done through Master-Slave systems, this is not emphasised and is not the focus of Shards.

While any number of shards can be added to the system as long as the network supports, the number of nodes in a MySQL Cluster is limited. As of MySQL Cluster 5.16 , the maximum number of nodes a Cluster can have is 63 and the maximum number of supported data nodes is 48.

SQL Operations:

On Shards, joins are inherently impossible. MySQL Clusters support joins but they are much slower than on traditional databases as it has to be run against data on different nodes.

The MySQL uses the NDB storage engine which imposes some restrictions on memory consumed. The free memory that results from a row deletion is not recovered. So other tables cannot use this memory. But if another row is inserted into the same table, it can utilize the data. Only dropping the table will actually release the memory. With Shards , one has the freedom to choose the type of storage engine.

MySQL Clusters do not support a few features of the traditional database systems, owing to using the NDB storage engine. Some of them are Temporary tables, indexes on BLOB or TEXT data types, FULLTEXT indexes, partial transactions, partial rollbacks of transactions.

MySQL Cluster does not handle large transactions well; it is better to perform a number of small transactions with a few operations each than to attempt a single large transaction containing a great many operations.

The limitations of MySQL Cluster on naming is as follows: Attribute names are truncated to 31 characters, and if not unique after truncation give rise to errors. Database names and table names can total a maximum of 122 characters. The maximum number of attributes (that is, columns and indexes) per table is limited to 128.

Horizontal Partitioning:

On Shards the partitioning logic can be defined by the user. So essentially any logic can be employed. As of MySQL Cluster 5.1.6, all Cluster tables are by default partitioned by KEY using the table's primary key as the partitioning key. If no primary key exists, a hidden autoincrement column is created and used as primary key which is also used for partitioning.

On Shards, an user application ensures that the query goes to the shard where it is relevant based on the partitioning logic. This can add more work on the developer side, but reduces the no of queries executed.

For MySQL Cluster, there is a single point of contact to connect to the database, and the developer/user is screened from the internals of where the data is stored. The query may be executed on more than one database, (depending on the query).

Rebalancing Data & Backup Strategies:

In shard, These features have been custom implemented currently. There are some libraries that are coming up to assist the same.

On a MySQL Cluster these are implemented and require least work.

Wednesday, June 25, 2008

Configuring Apache

I have Ubuntu Hardy Heron installed on my Lenovo Z60m. I wanted to try installing apache 2.2.9 without using the package manager apt-get with options

./configure --enable-modules=all --enable-mods-shared=all --enable-mem-cache --enable-disk-cache --enable-example --enable-so

But it failed with

configure: WARNING: Unrecognized options

This remained true for any configuration option except --help.

If you find yourself in this situation:
1. Ensure build-essentials is installed
2. Ensure gcc and its libraries are installed
3. go to the /untarred/apache/httpd directory. run './buildconf'

Hope this helps!

Tuesday, June 24, 2008

Yahoo! Babel Fish Vs Google Translate : A Case in Point

I was planning/dreaming to create and host my website and had some grand ideas for it. I started off by trying to setting up an apache server on my old faithful Lenovo Z60m and Ubuntu Hardy Heron. I was facing a rather obscure error in Apache 2.2.9 configuration.

configure: WARNING: Unrecognized options: --enable-rewrite, --enable-speling

Now that was weird and I started sifting through tonnes and tonnes of google results to find a solution. I came upon a Spanish version of Ubuntu forum that seemed to address my solution but the google's corresponding translated page was not available.

So I decided to have it translated through one of the free translators available.
The spanish text was as below


Lo primero que debes hacer es instalar los que se llaman Fuentes o de
Desarrollo (devel), que son los que tienen librerias de compilacion y
dependencias para casos como el tuyo, trato de ver que modulos debes
instalar, pero intenta con lo que te dije.

Ademas para instalar desde las fuentes como estas haciendo debes tener

instalado C++ y todas sus librerias.

I first tried on Yahoo! Babel Fish and this is what I got!

Hello: First that you must do it is to install those that are called Sources or of Development (devel), that is those that has bookshops of compilation and dependencies for cases like yours, treatment to see that units you must install, but tries and so I said to you. Furthermore to install from the sources since these doing you must have installed C++ and all bookshops.

It was living up true to its name. I was Babel I could not make out anything out of it.

Then I tried Google Translate. Lo and behold! What I got out of it made sense and it was actually well constructed!


The first thing you should do is to install so called sources or
Development (devel), which are the libraries of compilation and
units for cases like yours, I try to see that module should
install, but try what you said.

In addition to installing from sources such as these you have done
C + + installed and all its bookstores.

Encouraged by these results I tried out a few more :


Hello: The first thing you should do is to install so called sources or Development (devel), which are the libraries of compilation and units for cases like yours, I try to see that you can install modules, but what we try I told you. In addition to installing from sources such as doing you must have C + + and all its bookstores.



The first thing that you must do is to install those who are called Sources or of
I develop (devel), that are those who have bookstores of compilation and
dependencies for cases as yours, I try to see that you owe modules
to install, but it tries with what I said to you.

Also to install from the sources how are you doing you must have
installed C and all his bookstores.



The first thing that you must do is to install those who are called Sources or of
I develop (devel), that are those who have bookstores of compilation and
dependencies for cases as yours, I try to see that you owe modules
to install, but it tries with what I said to you.

Also to install from the sources how are you doing you must have
installed C and all his bookstores.

Dictionary.com Translator


First that you must do it is to install those that are called Sources or of Development (devel), that is those that has bookstores of compilation and dependancies for cases like yours, treatment to see that units you must install, but tries and so I said to you.

Furthermore to install from the sources since these doing you must have installed C++ and all bookstores.

It turned out that the Google Translate was more accurate and more 'english' (for want of a better description) than the rest. But one must bear in mind that the context is highly technical and specific to a rather narrow field of computer science and the tiny experiment is no true test at all. Hence any inference drawn from this is in no way representative of the general corpus of data and does not reflect the true capabilities of the translators. Disassociations apart, now I know where to go if I need a technical page to be translated ;) .

Monday, May 5, 2008

Reinstalling Cisco vpnclient on Ubuntu 8.04

I had recently upgraded to Ubuntu 8.04 on my office workstation. While this is a quite interesting move, some of my custom installed modules have stopped working or removed. This includes vim, eclipse europa etc. The latest to join the list of broken packages is vpnclient. I tried to reinstall vpnclient. While the installation itself went through fine, when I tried to start, it failed with the following error.

sharmi@sharmi:~/Downloads/vpnclient$ sudo /etc/init.d/vpnclient_init start
Starting /opt/cisco-vpnclient/bin/vpnclient: insmod: error inserting '/lib/modules/2.6.24-16-386/CiscoVPN/cisco_ipsec.ko': -1 Invalid module format
Failed (insmod)

That is when I learnt about the new patch required to run the vpnclient with in Ubuntu 8.04

So to reinstall, I will assume that the vpnclient is already downloaded and untarred.
  1. Update the linux-headers to version 2.6.24-16-386 (also called kernel-headers in some distributions)
    • Start the Synaptic Package Manager: System->Administration->Synaptic Package Manager
    • search for 'linux-headers'. Choose linux-headers-2.6.24-16-386 for installation
  2. To apply the patch and install vpnclient
This worked for me :) Hope it helps you too ..

Tuesday, April 22, 2008

Python is Cool! - xkcd.com

I'm a regular reader of xkcd.com and I have come across some really funny & thoughtful works. I was thrilled to see the comics on python as I love code in python too and believe completely in what the comics depict. And I want to share it with you.

Thursday, April 17, 2008

Solution: java.lang.NoClassDefFoundError: org/apache/ode/tools/sendsoap/cline/HttpSoapSender

Of late, I have been involved in evaluating a few workflow engines to see their usefulness for my purpose. One of the candidates was Apache Ode. Just that it was a product of Apache itself qualified it for the evaluation.

I was trying to get it up on a linux box, specifically Ubuntu 7.10. I had it deployed in tomcat and Apache Ode was up and running. I tried testing it with

bin/sendsoap http://localhost:8080/ode/processes/helloWorld examples/HelloWorld2/testRequest.soap

It came back with java.lang.NoClassDefFoundError: org/apache/ode/tools/sendsoap/cline/HttpSoapSender

Examining the sendmail code , this stood out.
# Add Ode libraries
for f in ls $LIB/*.jar

While this is intended to get all jar files in $LIB path, it fails when there is a space in $LIB path.

The path I was working out of was /home/sharmi/Work/trials/apache ode/apache-ode-war-1.1.1
The space in 'apache ode' was the spoiler.
Changing the path to /home/sharmi/Work/trials/apacheode/apache-ode-war-1.1.1 did the trick :)

Friday, February 29, 2008

Budget India 2008 - Mixed Reactions

The Union Budget for 2008-2009 was announced today by the Finance Mininster Dr. Chindambaram. It is certainly a relief for tax payers as quoted by the Hindu

The Finance Minister has raised the income tax exemption limit from Rs. 1,10,000 to Rs. 1,50,000, thus giving every assessee a relief at minimum of Rs. 4,000. The tax rate will be 10 per cent for the income slab between Rs. 1,50,001 and Rs. 3,00,000 and 20 per cent between Rs. 3,00,001 and Rs. 5,00,000. For income of Rs. 5,00,001 and above the income tax rate will be 30 per cent. The exemption limit for women assessees has been increased to Rs. 1,80,000 and in case of senior citizens to Rs. 2,25,000. The Finance Minister has not proposed any change in corporate income tax and in the rate of surcharge. A person paying medical insurance premium for his parents will be allowed an additional deduction of Rs. 15,000 under Section 80D.
While this is certainly a relief, what the action of bringing
asset management service provided under ULIP, services provided by stock/commodity exchanges and clearing houses, right to use goods in cases where VAT is not payable, and customized software under Services Tax is to be seen. This would be significant because, people tend to invest mostly in mutual funds and stocks to avail tax exemption through investment.

Rs. 60,000 crore debt relief package for farmers is a cause of concern only spreads worry and dissent. Don't get me wrong. I do understand that farmers need help direly. What I do see is that every government tends to give debt waivers at every instant possible. This actually creates a mindset that farmers need not pay nationalized banks debts as they will be waivered anyway. What is required is that farmers are able to better utilize their resources, availability of better tools for their job, proper irrigation is available, land fertility is maintained and they have all the basic necessities and facilities like schools, clean drinking water, health care. We are yet to see these reach the farmers. What progress and awareness we see among them are rarely government's efforts but are works of some dedicated individuals, NGOs or the urban progress slowly percolating to their level. But real progress is yet to reach the grass-root levels. All the development works planned by the government are sucked dry before they reach the target audience. Only when the corruption in the system is uprooted and accountability and self-discipline sets in, will we truly see 'India Shining'.

Monday, January 14, 2008

Pitfalls in handling multiple exceptions: UnboundLocalError: local: 'ValueError'

The below code raised UnboundedLocalException for 'ValueError' which is builtin. I wondered how it could be unbounded and tried to figure out what went wrong and this is what I got.
#code to do something
if (condition):
raise ValueError
except java.lang.NullPointerException, ValueError:
#code to print exception message

UnboundLocalError: local: 'ValueError'

This is when I learnt about one of the inner workings of python.

The python interpretor scans the code block once. if a name is bound to a variable with out being specified as global, it is considered local.

x = 2
def func():
x = 3
print x
print x

will print



This is because in func, x is bound locally to the variable containing the value <3>.

But the same function with the small difference will fail with 'UnBoundLocalError: local: x':
x = 2
def func():
print x
x = 3
print x

This is again because, the interpreter first scans func. It sees x is bound locally as it is assigned a value 3 within the innerscope. When it tries to execute, the print statement precedes the binding statement. At the point, x is a free variable with no binding. Hence the error.

Now in the case of the exception handling,

The construct is

#Do something
except Exception, e:
#e is bound to exception obj.
#Do something with e

e is actually bound to the instance of the class Exception and becomes a local variable.

Now considering the initial scenario, the ValueError name is bound to the instance of the 'java.lang.NullPointerException' exception. This overrides the scope of the global builtins and makes ValueError a binding to local variable and the binding happens at the except line. So when raise ValueError is called an UnboundLocalError is thrown.

So how to we handle multiple exceptions?

#code to do something
if (condition):
raise ValueError
except (java.lang.NullPointerException, ValueError):
#code to print exception message

For more interesting patterns see here.

Wednesday, January 2, 2008

Why I love Python

When I met an old friend of mine, he inquired about my work. I replied, rather gushed out, that it was wonderful 'cause I was working in a dynamic language called python. He gave me a quizzical look and asked how could you call work is good because the language is good, for after all language was just the tool. I listed out some of the features of python but somehow I felt it was not sufficient reason. Then we drifted on to to other topics.

Later, I thought back on the issue. Why I fell in love with Python is almost the story of my life. I got introduced to computers in my third grade. That was a time when computers were still rare in schools almost non-existent in homes. Then we just tried executing basic programs already created by our teachers or play games. One such program would get your name as input and print "Hello `name`" as output. I was fascinated! The next two years we spent learning who Charles Babbage was and what ENIAC was. In 5th and 6th grade I finally got to write BASIC programs. We wrote programs to calculate squares, students marks and generate multiplication tables. Normal though it seems now, I was thrilled! With a few lines of instructions, I could make the computer calculate almost anything! I took special pride in the fact that I could express my solution much better than most of my class mates. I had decided I'm going to work with computers in the future.

Then I had to change schools. My new school did not have computer classes until 11th.
When I finally got to go near computers, we started at BASIC and went on to Visual Basic. Things progressed and I got an opportunity do my B.Tech in Information Technology in a prestigious Institution. In our first sem, we had a C course which ended a disaster due to various tragicomic reasons. In our second sem, we got a wonderful Prof Mr. V. Ramachandran, who in a very short span of time, taught us C, C++ and Java, carefully ensuring we understood the underlying design, its implications and relative differences. We liked programming in C. But C++ seemed to be design filled with complications for the sake of complication. Java was relatively better. Most of our projects were done in C++ or Java. So either we spent a lot of time debugging or doing a tour of referencing to write a line of code. Some where in the process, the focus of programming shifted from understanding and solving the problem to getting constructs right and searching for the right class and method. And with that went most of the joy in programming. I still relished classes that dealt with logic like Data structures, networks, distributed OSes but I started drifting away slowly from programming.

Then I joined this cool startup that promised to bring everything under the sun to your handheld. And they used python. That was were I saw Python for the first time and It was love at first sight. The zen of Python captured my heart and all the joy of programming came back to me. Now I did not have to worry unnecessarily abt pointers or obscure constructs and factories. (Don't get me wrong, I do respect C and C++'s ability to get below the hood, but they are just an overkill for most apps ) I could spend my efforts on finding an efficient solution and as and when I reorganize my solution, my code could be reorganized with minimal effort. The coded solutions were mostly optimal but in some cases were optimization was desired, a simple change in construct would suffice. In only one situation, we had to do a C implementation for the bottleneck module.

I have been working for two years now. Python has helped me improve my coding discipline and productivity a lot. I am able to appreciate the value of Linux and vim better. It has also made me more receptive towards new languages.

Thank you, Python!