Database management is not as sexy as a lot of the machine learning content on medium, but it is important to understand at any data science position, from the most entry-level to the most senior. Today’s blog will be about database normalization. Database normalization at its core is about organizing a database and reducing redundancies to make it more efficient. In relational database management, this concept is incredibly important. If related tables contain duplicate entries, much of their purpose of gaining efficiency is lost.
The ideal table is limited to one purpose and each row is a unique entry. Say for example you are the administrator of a college database. You have students for each row then credits, admission year, and age as your columns. It would be better to have three tables with student and credits as one, student and years another and student and age as the third one. This way, when you need that data it can be easily accessed with less work by the computer. With only three columns it's probably not hugely important, but what if you have hundreds or thousands? Furthermore, how are you going to keep track of what columns you do have in your database easily if they are all crammed into one place?
A poorly constructed database is a self-fulfilling prophecy: if your database is confusing or inefficient, making it hard to work with, more shortcuts will be taken leading to further redundancies. Even worse the thought of properly normalizing it becomes more and more daunting. I have had my email since eighth grade, It was not until 18 that I started using it for something professional and realized it was not normal to have 6000 spam emails I never dealt with. Now I am up to 20,000, I have just accepted it as a minor inconvenience in my life. Doing this with something you use every day seems like it only steals a few seconds, but all those seconds add up. Your database queries will take longer and could even be inaccurate when using aggregate functions.
There are a couple of different ways to normalize your database, the first is …. the first normal form (FNF). The FNF is simply about making sure that each entry in a database is unique. Say for example you had a table with individual dogs but that their information was grouped by breed. In the FNF this information would instead be stored by each dog and each one would have a primary key. This one is pretty intuitive, this table is objectively hard to read:
Under FNF it should be something like this which is far more legible:
Now each specific order is a row rather than going by customer_id, this makes it much easy to query the data since no column contains information on multiple orders. In the first iteration, how would you call an aggregate function?
The second and third normal forms add some complexity to the first. The second form involves making it so that each column is dependent on the primary key. In the earlier table, the address is not dependent on the order but the customer. To be in 2NF we would need two tables like this:
Now if a customer gets a new address it can be updated easily since this information is only stored in one place.
For 3NF we must eliminate something called transitive dependency. This is when two columns are both dependent on the primary key, but one is actually dependent on the other. Querying the table will be faster since you are not pulling up data about say the type of something and that attributes, but just the type. Which can quickly be used as the primary key in another table to get its attributes. Let's imagine that we want to add the type of shipping they chose and how long that shipping type takes:
Now shipping kind and speed are both dependent on the order but the table can be simplified since really the speed is dependent on the kind of shipping
These are some of the very basics of relational database management. Sometimes these concepts seem intuitive but in making this blog I realized that they can be difficult to understand. I highly recommend trying to set up your own tables and see what the differences between 2NF and 3NF are.
Also I am not entirely sure I understand everything I researched for this blog, especially the nuances between 2NF and 3NF, so if anyone has any feedback, I would love to hear it!