|By Martin C. Brown||
|April 6, 2005 12:00 AM EDT||
Getting the design of your database right has a huge impact on a number of issues. Primarily performance, but it can also affect other areas like the flexibility of your database and how it integrates with the application-level solution, such as PHP or Perl, that you're using.
Beginning MySQL Database Design and Optimization by Jon Stephens and Chad Russell covers the whole gamut of database design from choosing the right data types to the right database type and index. I talked to the authors about the book, MySQL 5.0, and database design.
This book has a slightly different feel to it than other MySQL titles I've read recently. Can you tell us what the primary motivation for the title was?
Jon: While we've yet to meet in person, Chad and I have known each other for several years through the HiveMinds.Info forums, and we've worked together on a couple of Web site development projects. It was these that we discovered that we shared a profound sense of exasperation caused by badly designed databases, and this feeling was heightened by our mutual observation that many of the problems we encountered were the same ones over and over again. We found ourselves saying, "Gee, we could write a book..." I mentioned this to an editor friend of mine at Apress. At the time they were looking for a companion or follow up to Martin Kofler's MySQL book - a new edition was about to be published - and the response from Apress was, "You should write such a book for us." And so we did.
As for the theme and title, there were a couple of projects we'd worked on together where we had to import databases provided by vendors. These were essentially huge ungainly spreadsheets - hence "The Spreadsheet Syndrome" that we refer to in the book. The amount of bloat and redundancy was staggering. In one instance, we took a database dump, imported it into MySQL using the original vendor schema, and it took up something like 100MB-120MB. It was data on about 10,000 products. We spent some time normalising the schema and choosing more appropriate data types (all the columns in the original were TEXT columns), and when we were finished, we'd cut the size down to about 20%. And we didn't lost a bit of essential data in the process.
Early on you talk about the importance of choosing the correct data type for a particular database. Your concerns here are for size, rather than...
Jon: Size. While a lot of the concepts that we cover in the book can be used in any setting (they aren't tied to MySQL), we're conscious of the fact that MySQL is most commonly used as a back-end for Web sites. That's how we came to start using it ourselves. In many cases, the person developing the site doesn't necessarily control the server it's being hosted on, and a lot of hosts put restrictions on database size, or charge more for packages that provide more DB space. So a few hours spent optimising can potentially save you some hosting fees.
There's also the fact that, when you're working with very complex queries, especially complex joins, it's very difficult to get performance increases using indexes alone beyond a certain point. In such cases, better performance is likely to be achieved by making your data as compact as possible, maximising the use of fixed-width columns, and minimising redundancy.
Data types. However, we're not just concerned about size. Later in the book, we show some examples where not using a suitable data type can lead to problems. For instance, if you store dates as strings, you're liable to get some nasty surprises when sorting, because you assume that since you're calling something a "date" MySQL will do likewise, and this isn't always the case. This is particularly true regarding dates, but isn't restricted to them by any means. You'll also run into problems when trying to use SQL functions that act on a particular type if the data isn't actually that type. MySQL tries to be helpful and make type conversions for you, but we think the most prudent thing to do is to use the correct/appropriate data type in the first place, and avoid any surprises.
Staying on the topic of data type. Is there any performance advantage to using fixed-length CHAR compared to the variable-length VARCHAR type? If there isn't, why do we need both?
Jon: Well, in MySQL it depends in part on the storage engine. If you're using MyISAM tables, then fixed-width columns tend to be faster, so for something like a U.S. zip code where all the data is a fixed size, use CHAR(5) or CHAR(9). However, for InnoDB, VARCHAR is actually better optimised than CHAR.
Chad: VARCHAR is important here because it will help you keep the size of your database down when storing fields that aren't always a fixed size like zip codes or phone numbers, and by keeping the size down you can increase performance as well.
Joins are critical to modern relational design. How much can performance be improved using the correct join?
Jon: Simply using indexes correctly can make for incredible speed gains, by a factor of 10 or more; sometimes even a hundredfold or greater increase in performance can be seen this way. It's also important to write joins in such a way that you get only the data that you actually require. I've seen lots of applications where the developer either couldn't or didn't bother to zero in on the precise set of data needed, then threw away rows in the application code. This is very, very bad; all that extraneous data still has to travel between the server and the client. It becomes more noticeable when the server and the client reside on different physical machines so the data has to travel over a network. It increases even more - perhaps even exponentially - with the amount of data selected.
If MySQL supported server-side cursors in a production release, I might feel a bit differently about this, but - until it does - that's my story and I'm sticking to it. Which means, hopefully, that I'll be changing my story in six months or so with the official release of MySQL 5.0.
I think many would agree that joins can be the most complicated part of most SQL statements. Is it better to approach the need for joins during design rather than after the fact?
Chad: I think that it's always best to try to address these needs during the design phase; however, sometimes it's impossible to nail everything. At times there are problems that have to be addressed after the fact or deserve another look to try to improve performance further.
MySQL - and SQL in general - is a lot more than just a query language, and using these functions is something you advocate strongly. Do these functions really offer a better solution than using the Perl, Python, or PHP wrapper to achieve the same result? Surely you're just shifting the onus onto a different component in the same chain?
Jon: It's almost always faster to do as much processing as possible in the database than in the application as can be shown by doing a bit of testing. This is especially true when using an interpreted scripting language for the middleware layer. Another issue is portability. I'm not talking about portability between databases here, which, popular as it may be with some people, I regard as a largely useless concept.
Jeremy Zawodny sums up his reasons for feeling this way quite well in his blog (http://jeremy.zawodny.com/blog/archives/002194.html) and I happen to agree with him. There's an example in the book where we port a simple PHP script that uses MySQL to Python. Because the heavy lifting (calculation/manipulation) gets done in the SQL layer, porting is almost completely a matter of substituting Python commands for submitting the query, retrieving the result set, and outputting the result set data. No string-handling or other processing is required in the application code. In the book, I think we said that it took me about 30 minutes to port the example, and that's pretty much the truth. If I were a better Python programmer, I could have done it in 10 minutes, maybe even five.
Chad: I always try to keep as much of the processing in the database rather than the application when appropriate. As far as which is the better solution I think that's a highly arguable topic with no "right" answer. I think if you're building an application that will run on myriad databases, then shifting this on the application language would make the most sense. However if you're focusing on a single platform then keeping this database side would make the most sense even if at some point down the road some part of the architecture should change such as database or application language.
Thinking about the overall application, should we as developers write the application around specific SQL queries rather than see SQL as just a way of getting data?
Chad: I don't think that the application should ever be developed around any language instead of the function that it should perform. Form should always follow function in my opinion. A famous architect named Frank Lloyd Wright long ago introduced this concept to architecture. I think we, as developers, engineers, and software architects, can learn a lot from this concept and apply it to our works every day. I think the more correct approach is how much of the functionality you can use from your SQL than a simple "SELECT * FROM" and then do all the work on the application language side.
You're obviously fans of PHP and MySQL as a solution. Are there any improvements you can recommend to this Web development nirvana?
Jon: Yes. Migrate to MySQL 4.1, PHP5, and ext/mysqli as soon as possible. Preferably yesterday. Then you'll be able to exploit a fast, clean, object-oriented API with a number of nice new features, including prepared statements. Your Web apps can be more secure thanks to the new client authentication in 4.1.1 and newer versions of MySQL. (Notice I say "can," not "will" here; obviously, if you do something silly like store DB passwords in Web-accessible pages, or fail to cover yourself against SQL injection attacks, the new authentication isn't going to improve things unless and until you address those other matters.)
Chad: I would like to see more marketing for PHP and mid-market and enterprise usage of PHP. MySQL AB seems to be doing a good job in this respect with MySQL and a lot of enterprises have either been using it or are beginning to adopt it. Hopefully this will happen with PHP and it will gain more respect in the enterprise and mid-market community.
MySQL 5.0 is just round the corner; in fact you use it a lot for the examples in your book. Can you summarize your favorite features in the new release?
Jon: Stored procedures rule. I think this more than any other single feature is going to help MySQL be recognised as a serious, enterprise-ready RDBMS. Since it makes it easy to recycle SQL code, it becomes easier to do work in the database instead of the application. I'm anxious to see SP multi-language capability implemented in MySQL; it would very cool to be able write procedures in PHP, for instance. It'll happen, but it may be a little while, probably not before 5.1, I don't think.
Chad: Again, I agree with Jon here. Stored Procedures and Triggers. I think that this will definitely give MySQL more visibility in the enterprise. It seems like this has been the looming reason why developers and IT managers have hesitated to adopt MySQL instead of another solution.
You use Windows in your examples, which must be relatively rare for a database seen as the mainstay of Linux and open source. Is this personal choice?
Jon: Well, I started out as a DOS and Windows user, and until fairly recently Windows was still my primary desktop environment. Not long after the book was completed, I switched over to Linux (SuSE 9.2). I've been using it now for close to three months, and I can't say that I miss Windows a great deal. I do occasionally use WINE or VMWare to run a few Windows apps when I have to. The one thing I do miss sometimes is a good Linux analogue to Visio or SmartDraw - Dia just isn't there yet.
I've also been quite impressed with recent versions of Mandrake and Fedora, and some of the new bootable-CD distributions are quite amazing, Knoppix in particular. Drop a CD in the tray, reboot, and - hey, presto! - three minutes later, you've got a functioning Linux desktop, complete with LAN and Internet connectivity, everything just works. I've yet to see anything like that happen in the Windows world.
I still have a couple of Windows machines kicking round, but I'll likely convert one of them to Linux soon. I've also started using Solaris 10/x86 on a spare machine - that probably won't be my primary desktop OS anytime soon, but it's remarkably stable and good for doing some testing.
Basically, I didn't do all those screenshots on Windows because I wanted to push MySQL on Windows; I did it because I was - at the time - more comfortable with Windows for my desktop environment, and I could get the best-quality screenshots in the least amount of time with the Windows software I had, knew, and used for that purpose. I wanted to make the change earlier, but simply didn't have time to do it while working full-time and writing a book.
So much for the evolution of my personal habits and preferences. Let's talk about MySQL in this context. Very simply, the OS (almost) doesn't matter. One of the cool things about MySQL - and one of the reasons it became my DB of choice three or four years ago - is that it runs virtually identically on Linux, Windows, Mac, Solaris, NetWare, HP Unix, etc. All of the command-line examples you see in the book - and there, I guess, a couple hundred of them - were tested on both Windows and Linux, and in almost every case, they worked identically.
If you're concerned with creating and working with MySQL and MySQL databases that will be used on multiple platforms, we talk about this at some length in the early part of the book, but the rules are pretty simple:
- Don't depend on case-sensitivity for DB, table, or column identifiers.
- Don't use special characters or spaces in identifiers. Stick with alphanumerics and the underscore, and you should be good to go.
As for what platform people are actually using MySQL on...? Well, here are a couple of MySQL factoids for you:
- Something like 70% of all software downloads from dev.mysql.com are Win32 versions.
- According to a recent poll on the MySQL Web site, 26% of MySQL enterprise deployments are on Windows servers, with about 56% on various flavors of Linux. Given that Windows doesn't predominate as an enterprise server OS, this isn't that surprising. It suggests to me that a lot of Windows people are using MySQL for testing, development, and perhaps smaller-scale apps, but not so much for enterprise-level applications.
Chad: I didn't do any of the screen shots that came from the book. Jon handled those. To answer the question, I use and prefer Linux, however, I work in both on a day-to-day basis. Also I use databases that are usually on Linux. I rarely use them on Windows, unless doing something with a certain database that doesn't run on anything other than Windows. :)
If I have limited time to develop a complete MySQL solution, will I get more out of my time from optimizing the server (hardware, MySQL configuration) or the database design?
Jon: Obviously both these issues are important. But if I had to choose, I'd probably go with making sure the database design was optimized. Once the application is up and running and the database contains a lot of data, it's a lot less time-consuming and disruptive to upgrade hardware or tweak your MySQL configuration files than it is to make fundamental changes in the DB schema and possibly have to rewrite applications depending on it. And my experience is that throwing good hardware after a bad schema or a bad app tends not to be effective.
Chad: Again, I agree with Jon here. Heh, we should get together and write a book some time, eh? :) Definitely spending more time with the database design will give you the most bang for your buck here. It's more feasible that you upgrade your hardware than go back and fix your entire design and architecture, not to mention the problems you may have with data integrity.
You've managed to fit an incredible amount of information into a compact little book. Is there anything you wanted to cover that didn't make it into the book?
Jon: I have a growing interested in object-relational modeling, and I would have liked to include some material about it in the book, but there simply wasn't time, since we wanted to hit the shelf in time for the 4.1 production release (which we did, just barely, as it turned out). If we ever do a second edition, that's one area I'd like to address - at least cover some of the basics.
It would have been nice to have covered the MySQL C API, in the APIs chapter, and perhaps ODBC too. When we started, we intended to do it, but the chapter threatened to become a book-length project on its own. That's too bad, because Chad worked on a number of C, ODBC, and JDBC examples that we ended up having to cut. I think maybe there was supposed to be a section on Connector/.NET initially that would have been good to include. In the end, we decided to go for the "Big 3" open source scripting languages - PHP 4 and 5, Perl, and Python. I like to think we made the right choice, and that we got across the basic interaction between DB and application that would be useful to programmers in other languages too.
Chad: Yep, the code examples Jon and I originally wrote covered just about every language that a MySQL API existed for: C, C++, VB with ODBC, JDBC, and C#. It was a lot of good information but it was best to cut it.
I know you're both busy right now with other things, but can we expect to see more of your wisdom in the future?
Jon: I went to work for MySQL AB not long after the book was published. My first big project for them was to revamp all of the documentation relating to MySQL Cluster, which - in case you haven't heard - is a new distributed version of MySQL - and one of my ongoing tasks is to keep the Cluster documentation current. Cluster is *very* important to the firm's plans for 2005 and beyond, so this is a huge responsibility. By the way, they're a fabulous outfit to work for, and the calibre of the developers and others is just amazing. It blows me away to be working with folks like Paul DuBois, Mike Hillyer, Mikael Ronström (the original developer of NDB, which became MySQL Cluster), and Peter Gulutzan, just to name a few.
Looking further ahead, I'd like to resurrect a book on DOM scripting in multiple environments that I was working on for Apress in 2003 when it got set aside in favour of the MySQL DB Design book. There's no firm plans for that at the moment, however.
Chad: I was just recently brought on board as a technical reviewer for another MySQL book Apress is working on for Q2 '05 publication as well, other than that, I don't have other plans, unless Jon approaches me with the PHP 5 stuff. :)
You don't exactly live next to each other; did this pose any real problems in writing?
Jon: Not really. We're both pretty used to working remotely. Tools like e-mail, FTP, instant messaging, and IRC make it pretty easy (and cheap) to communicate in real-time, pass files back and forth, and so on. It might be relevant to mention here that both Apress and MySQL AB are largely "virtual companies" whose employees are scattered across several countries and they're both successful firms. One of the editors who was involved in the early phases of the book, Chris Mills, I've been working with for nearly four years now. He's in the U.K., I've been in the U.S.A. and Australia, and we've yet to meet in person. With regard to MySQL AB, I've met exactly one co-worker so far in real life, mostly because he also happens to live in Brisbane.
Chad and I have known each other for about five or six years through the HiveMinds community and its predecessor, that's how we came to work together on a site project or two (and why Chad hosts most of my sites), and it's all been online. You can reach HiveMinds at www.hiveminds.info or www.hiveminds.org - we'd love to have you.
By one of those quirks of fate that make life interesting, Chad lives quite close to my father so all those photos of dad's Aussie granddaughter are actually hosted on a server just a few kilometres - er, miles - away from him. It's a small world after all, isn't it?
Chad: Yeah, I can't say that it posed any problems. Jon and I have worked together in the past, as he said, and we have both worked with people around the world for a while now, so it seemed just as good as working with the guy in the next cube over. This is just one more great advantages of the Internet. Years ago it would have been unheard of to do this sort of collaboration living this far apart. It's funny the people it still catches off-guard when then realize that my co-author is Down Under.
What do you listen do/listen to/watch when writing?
Jon: I listen to a bit of everything, but my old reliables in the music department tend to be Pearl Jam and U2. I'm also a Yes fan from way back. I also like coffee, mango juice, and some amazing sugarless mints I discovered awhile back that are flavored with lemon grass and ginseng.
Chad: If I'm working from home, then either Squawk Box in the morning, or whatever's on in the evening. Other than that, I usually like it pretty quiet. I too enjoy coffee (preferably Dunkin Donuts) while working.
ABOUT CHAD RUSSELL
Chad Russell is a software engineer for a human capital management software company in Jacksonville, FL. He also works as a consultant for several start-ups and is currently launching a start-up of his own. He's worked in IT since 1999.
ABOUT JON STEPHENS
- ARM Server to Transform Cloud and Big Data to the Internet of Things
- Predictive Analytics for IT – Filling the Gaps in APM
- Ubuntu-based Open Source Linux Mint Tests KDE Version
- Donald Fischer Joins General Catalyst as Venture Partner
- Migrate Oracle Database from HP-UX to Red Hat Enterprise Linux
- Where the Internet of Things Stands
- Alice In Wonderland - Is the Network the Next Frontier for DevOps?
- Five Things Developers Need to Know About PaaS
- Linux Containers – Next-Gen Virtualization for Cloud
- Enterprise Cloud Analytics and Business Intelligence
- How Red Hat Enterprise Linux Shrinks Total Cost of Ownership (TCO)
- Emulex Optimises Performance and Scalability for Virtualisation and the Cloud with New Ethernet and Converged Network Adapters
- ARM Server to Transform Cloud and Big Data to the Internet of Things
- Revolution R Enterprise 7 now generally available
- Azure CLI Day 58 – storage account list
- Azure CLI Day 59 – storage account show
- Linux in the Mainstream: Growing Deployment of Business-Critical Workloads
- Consumer Electronics - Global Trends, Estimates and Forecasts, 2011-2018
- Red Hat Pathway to IT Modernization
- Cloud Development Using SCM 12.5
- Predictive Analytics for IT – Filling the Gaps in APM
- Ubuntu-based Open Source Linux Mint Tests KDE Version
- Federal Big Data Community Convenes 6 Feb at Apache Hadoop Forum
- UNIX Migration Alternatives: Red Hat Enterprise Linux vs. Windows Server
- The i-Technology Right Stuff
- After Ubuntu, Windows Looks Increasingly Bad, Increasingly Archaic, Increasingly Unfriendly
- Linux.SYS-CON.com Exclusive: Linus Discloses *Real* Fathers of Linux
- SCO CEO Posts Open Letter to the Open Source Community
- A Closer Look at Damn Small Linux
- Linus' Top Ten SCO Barbs
- Netscape Co-Founder's 12 Reasons for Growth of Open Source
- Where Are RIA Technologies Headed in 2008?
- Linux.SYS-CON.com Exclusive: What Would UserLinux Look Like?
- *POINT - COUNTERPOINT SPECIAL* What's Wrong with the Open Source Community?
- Why Recovering a Deleted Ext3 File Is Difficult . . .
- Introducing "Cooperative Linux" - Linux for Windows, No Less