Home . Services. Technology . Partners. Management .Success Stories .WhitePapers .ContactUs





 

Evolution of Smart Data Warehousing

Data warehouses typically start as a few tables in our transaction processing database to support some reporting requirements for one or two users. But before we know, the number of users and the number of tables grow and then we decide to officially call it a data warehouse. We may separate these tables out into a new schema or even create a new database for it. As the data volume grows, some queries take more time and the users complain. So we create a few indexes to speed them up. Then we notice that the data load into these tables take way too long.

As the data looks interesting some analyst may attempt a very complex query to get a meaningful business result, but that query may never come back.  We try to create some more indexes and may be even change the schema a little bit to get that valuable query to run, but it just refuses to work. Meanwhile more and more users are starting to query the data warehouse with typical analytical queries like trending over time and ranking. Soon we realize that we may not be able to tune the existing tables. When we look at the size of the indexes and find that it is many times the size of the base table, we know something is not right.

May be we should look at the schema design and see if data warehouse best practices are followed. We may want to check if the data integrity is maintained, if the data types used in the join conditions are efficient, if we can eliminate unnecessary outer joins by using data transformation rules. Did we pay attention to the impact on the data load time when we created new indexes to tune those queries? On top of all these considerations, does our “Data Warehouse” actually serve the business purpose? Ideally we would like to provide the management with cool looking dashboards with the key metrics at their finger tips. But when the queries run for minutes or hours, you cannot present them in a dashboard.

So we redesign and make sure we have a dimensional model with data integrity. We implement   transformation rules to pre-calculate values, build categories and hierarchies. After all this effort, our queries do well for some time. But the data volume keeps growing and the queries get slower once again. Now we start creating aggregate tables and start thinking about OLAP cubes. These are good to speed up summary level queries with limited dimensional attributes but pose a lot of challenges for the broader business requirements. It is painful to maintain the data in the aggregate tables and cubes in sync. Providing drill down from here to the base table data is yet another story. It would be so much easier if the queries can run fast from the base fact and dimension tables. It becomes clear to us that irrespective of aggregate tables or cubes, the base table queries need to run fast. So we take a deep breath and start a massive server infrastructure overhaul. In the non-MySQL data warehousing world, I have seen millions of dollars getting invested in such efforts and have wondered if there was enough ROI.

Fortunately, there are some innovative options in the MySQL world which address these issues. There is no need for unnecessary indexes, aggregate tables or cubes. With proper data warehouse design and best practices, column store, data compression and a large memory pool we can run complex queries effortlessly and present meaningful dashboards to the business. The right technology combined with the right design can serve the purpose. Best of all, it does not have to be expensive. If the tables get too big, we could partition or shard but we still want the number of servers to be relatively small so we can maintain the ecosystem easily. Now that to me is smart data warehousing.

 

Seetha Sarma

Co-Founder, Himmath Technologies Inc

06/25/09

 

 

 

 

 

© Copyright 2009. Himmath Technologies, Inc. All rights reserved.