# Miscellaneous > Database Design >  Need to tap your brains....

## nicc777

Hi

I read an article on Linuxjournal, and of particular interest for me was the following statement:




> Any lingering doubts I had about Collins' geek side vanished when he explained how he managed to compress a 750 meg database. The database contained 21 pages of financial history for more than 5,000 firms. He needed it to fit on an 80 meg drive. He began by trying the Huffman compression algorithm. It didn't squeeze tight enough. It would take the air out of the text portion of the database just fine, but numbers were more difficult and didn't compress as well. One day when he was driving home, it came to him. He converted the numbers to base-256 and voila, it worked and the entire database now fit easily on the 80-meg drive.


Now I wonder - is this an often used practise? What is the effects on performance?

I would like to hear from you what you think about this, seemingly, interesting solution.

Cheers

----------


## skhanal

Compression and uncompression always take CPU cycles to process, so there is definitely some performance issues.

Compression is not a common practice for a live database. But it can be useful for archiving old data.

----------


## tombest

Compression doesn't necessarily mean things are slower.  Reading and writing fewer bytes on disk will improve performance.  So, it is a trade-off that is specific to the application and it's data attributes.

----------


## MAK

But also consider this. Instead of writing a row in one sec if the processor take 1 more sec to compress it, it will take 2 secs to write the same row.

And also reading (retrieval) take twice the time it used to take because the processor has to decompress it.

The compression logic should be written in such a way that it uses the CPU time when the CPU is idle or Off peak time.

----------


## tombest

Here is an example in Oracle, where compression results in faster writes.  I could generate the same type of example for reads.

Consider a table x with a single character column with just a bunch of "a"s in it:

SQL> select * from x where rownum < 5;

A
--------------------------------------------------------------------------------
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaa

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaa

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaa

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaa

It has 65536 rows in it.

I create table Y with compression, and table Z without, and at the same time insert the data from table X into each of these new tables.  Note the timing reported after each create.  I switched back and forth between the two just to eliminate spurious performance problems on the DB:

SQL> create table y  organization heap compress as select * from x;

Table created.

Elapsed: 00:00:00.09
SQL> create table z  organization heap nocompress as select * from x;

Table created.

Elapsed: 00:00:04.03
SQL> create table y2  organization heap compress as select * from x;

Table created.

Elapsed: 00:00:01.04
SQL> create table z2  organization heap nocompress as select * from x;

Table created.

Elapsed: 00:00:04.07

Thus, compression can give better performance (almost 4 times faster in this case) in addition to space savings.  

So, it depends on the data.  Compressing less-compressible data may cause worse performance, yes.  But, it depends on the data.

----------


## MAK

In my opinion,

It depends on lot of following. # of preocessors, RAM, Speed (rpm) of HDD, Raid level in HDD, # of connections, # of processes running in machines, WHat kind of index you use, how many file groups you have and so on etc.

Mostly production boxes are built for stability and handle volumes of processing. When you put compression logic into it, for small amount of data it works fine and faster. But for large volumes of data with large # of connections it is not advisable.

----------


## nicc777

But, if we take the compression logic away from the DB and let the client handle it, it will no longer be a problem - I thnik. 

I must admit, I still haven't decided what all the pro's and con's could be, but consider the following: 

1) Some application act as a collector of data. The collector uses a certain industry standard for data compression ( zlib as an example ). The collector handles the compression logic of the data BEFORE it's commited to the DB. I of course assume that this will be a seperate physical box ( or cluster ).

2) The DB stores only the compressed data, therefor it handles no compression logic.

3) The client requesting data will recieve the raw data - which means still in compressed form. The client can now uncompress the data for display. Of course there can be several or even hundreds of clients.

Given the above, there is now a new problem - how do I search for info in the compressed data tables? One way could be to create search and index tables. Depending on what we store in the tables, this system could ( my opinion ) eventually grow just as big as the same DB without using compression.

What I do agree with is that for now, this approach seems to be better for archiving old data. For live data, Caching with network traffic compression seems the better solution.

Once again, thank you all for your input - I look forward to hear some more opinions.

Cheers

----------


## tombest

What DBMS are you using?

----------


## skhanal

It all depends on the type of data, compression logic handle text very well but if you have image then it may not give you as much benefit.

In Tombest's example, the result is skewed because of the data, if you only have all a in the table then it is very easy for the compression program, just count the number of a and that's the compression. But in real world you have all mix of characters. 

But it proves a point for one extreme end, if you do the same for randomly generated text then that might be more useful test.

----------


## nicc777

> _Originally posted by tombest_ 
> *What DBMS are you using?*


LAMP - Linux, Apache, MySQL and Perl

In my case I make extensive use of mod_perl.

I am always open to other technologies, and have recently started to investigate SAP. I still have a looooong way to go though. 

I also played a bit with Oracle, but found it bloated. I think Oracle is the kind of thing which you should start with when you're still young  :Smilie:  BTW: I think my SAP research will end up like Oracle. It just seems to much to learn and I simply don't have the time at hand.

My worst experience so far was MS SQL Server 2000. Nice features and relative easy to manage - but very slow. At one stage I thought I was going mad, but then I made a simple test - installed MS SQL on a box and run parralel with MySQL. Did some loadrunner tests - and the results was expectedly obvious. Although, the moment MS SQL get's the speed of MySQL, I might start to look at it again ( I think part of the bottleneck was ODBC - I know to little about ODBC to be sure though ).

There you have it. Where the budget allows, I will always try to keep the DB and application on seperate boxes. I already have a pretty good skeleton plan for clustering just about anything. I know MySQL doesn't really cluster, but I tackle it with the following solution: Set up a Master MySQL DB, with as many replicated slaves as you need. Between the application cluster and the DB cluster I then use PEN ( tcp load balancer ) to balance all SELECT queries between all the slaves. I use the master only for INSERTS or UPDATES. The application server clusteres very easy with openMosix on Linux and Parallel::ForkManager. I have also used some caching mechanisms, although the apps I currently run are all real time apps that can't really use a caching mechanism.

At this stage storage is not a major hassle as my largest DB only has about 40GB of data. Modern RAID SCSI arrays are more then enough to handle this. I do play around with partitioning of the drives, and making use of symlinks to seperate the MySQL tables on different spindles ( I think thats what you call it? ) although to MySQL it looks like they are all still one the same place: /var/spool/mysql

For those interested in the technologies I mentioned, here are some links:

 *  Google Linux Directory 

 * Apache Home Page AND The Google Apache Directory 

 * The Official MySQL Home Page AND The Google MySQL Directory 

 * Perl Google Directory 

 * OpenMosix Home Page 

 * PEN Home Page ( tcp load balancer ) 

 * Parallel::ForkManager on CPAN 

 * mod_perl Home Page 

Well, that's about it.

Cheers

----------


## tombest

Regarding whether to compress or not, I wa s presenting data to disprove the earlier claim that compression will result in slower performance because it will use (twice, he said) the CPU cycles.

So, the session I showed was that in the case where the data is highly compressible, you GAIN performance with compression.

Below, is another example where you LOSE performance because of the compression.

In this example, the data in the table is actually zipped data - a big blob of ZIP data.  So, it is not very compressible.

The difference in timing is not as drastic in my particular case (32 vs 33 seconds) but it was consistent (I did it a few times):

SQL> create table y organization heap compress as select * from x;

Table created.

Elapsed: 00:00:33.07
SQL> create table z organization heap nocompress as select * from x;

Table created.
Elapsed: 00:00:32.08

So, the point is, it depends on the data.  You can't make a statement that compression is faster or slower without considering the data.

----------


## MAK

But we cant be optimisic that the data will be all 'a' s and easily compressable and it will not. 

we should always consider worst scenerios so that when it is live we can handle it better.

----------


## tombest

Yes, that is exactly my point.  Did you see my post - it says "it depends on the data".  That's why I presented a case for each side.... the extreme for each side.

You said compression takes more CPU so it will be slower.  That is a FALSE.  Just trying to dispell that myth.

----------

