TRANSFORMING SINGLE SPREADSHEETS INTO NORMALIZED TABLES USING EXCEL - Kaylee P. Alexander
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TRANSFORMING SINGLE SPREADSHEETS INTO NORMALIZED TABLES USING EXCEL Kaylee P. Alexander askdata@duke.edu
Variable Data Type Description artist string name of the primary creator of the object sold DESCRIPTION OF WORKSHOP DATA artist_nationality string the nationality of the artist object_title string the title of the object sold category1 categorical the primary subject category of the object sold In this workshop we will be using a sample category2 categorical the secondary subject category of the object sold dataset that contains information about art type categorical the type of object sold sales that took place at the Knoedler Gallery in New York during the year 1946. dimensions_in string the dimensions of the object sold in inches The master datasheet consists of 228 rows and sale_year string year in which sale took place 14 columns (see table below). sale_month string month in which sale took place sale_day string day on which sale took place sales_total numerical price object sold for sales_currency categorical currency object was sold in seller string name of the seller buyer string name of the buyer Data Source: http://www.getty.edu/research/tools/provenance/
WHAT IS A RELATIONAL DATA MODEL? A relational data model organizes data into a series of tables containing columns (‘attributes’) and rows (‘records’) with unique keys identifying each record. Each table (or, ‘relation’) represents a single entity type and its attributes. The primary benefits of a relational data model include ensuring consistency as well as performing combinations of queries to understand various relationships that exist among the information contained in the various tables that would be otherwise difficult to determine from a single spreadsheet. An additional benefit is the ability to add records and edit information without the risk of compromising other information contained in the database. For more on relational data models, see: https://www.oracle.com/database/what-is-a-relational- database/
SINGLE RELATIONAL DATA MODEL DATASHEET Knoedler Main Artists Sales_1946 artist_id Subject Categories artist sale_id category_id artist_nationality artist_name artist_id category object_title artist_nationality object_title category1 category1_id category2 category2_id type type_id dimensions_in dimensions_in sale_year sale_year Object Types sale_month Collectors sale_month type_id sale_day collector_id sale_day type sales_total collector_name sales_total sales_currency sales_currency seller seller_id buyer buyer_id
WHAT IS DATA NORMALIZATION? Data normalization is the process of organizing and restructuring data attributes within a relational data model in order to reduce redundancy in the data set, increase consistency, and facilitate querying. For more on database normalization, see: http://agiledata.org/essays/dataNormalization.html.
GOALS • Demonstrate simple steps in Excel that can be used to transform a single spreadsheet into a series of normalized tables for a relational data model • Identify data entities and attributes • Create unique value lists using UNIQUE( ) • Assign unique keys to new tables • Populate columns in new tables using VLOOKUP( ) • Relate tables using foreign keys
EXCEL FUNCTIONS USED IN THIS WORKSHOP • UNIQUE(array, [by_col], [exactly_once]) – returns a list of unique values in a list or range • array – range or array from which to extract unique values • by_col – [optional] FALSE = sort by row (default). TRUE = sort by column • exactly_once – [optional] FALSE = all unique values (default); TRUE = values that occur once • VLOOKUP(value, table, col_index, [range_lookup]) – looks up a value in a table by matching on the first column and returns the matched value • value – value to look for in the first column of a table • table – table from which to retrieve a value • col_index – The column in the table from which to retrieve a value. • range_lookup – [optional] TRUE = approximate match (default); FALSE = exact match
Transforming Single Spreadsheets into Normalized Tables Using Excel Prepared by Kaylee P. Alexander | CDVS Graduate Assistant, Summer 2020 I. PURPOSE AND GOALS The purpose of this workshop is to demonstrate simple steps in Excel that you can take to transform a single spreadsheet (such as a master copy of your data that you used to facilitate the gathering process) into a series of normalized tables that can be used to populate a relational database model using, for example, MySQL. In order to accomplish this, we first identify entities and corresponding attributes of those entities within the master datasheet, and then create separate tables for each entity that can be connected to one another by foreign keys (columns that reference, by means of an identification code, columns present in other tables). This workshop does not require any coding experience, but it is recommended that users are familiar with Excel basics. Over the course of the workshop we will transform a single datasheet into five tables, each representing a set of unique entities (and their related attributes) from the original datasheet as well as a unique key that can be used to create relationships between the different tables. II. WHAT IS A RELATIONAL DATA MODEL? A relational data model organizes data into a series of tables containing columns (‘attributes’) and rows (‘records’) with unique keys identifying each record. Each table (or, ‘relation’) represents a single entity type and its attributes. The primary benefits of a relational data model include ensuring consistency as well as performing combinations of queries to understand various relationships that exist among the information contained in the various tables that would be otherwise difficult to determine from a single spreadsheet. An additional benefit is the ability to add records and edit information without the risk of compromising other information contained in the database. For more on relational data models, see: https://www.oracle.com/database/what-is-a-relational-database/. III. WHAT IS DATA NORMALIZATION? Data normalization is the process of organizing and restructuring data attributes within a relational data model in order to reduce redundancy in the data set, increase consistency, and facilitate querying. For more on database normalization, see: http://agiledata.org/essays/dataNormalization.html. 1
IV. EXCEL FUNCTIONS USED IN THIS WORKSHOP • UNIQUE(array, [by_col], [exactly_once]) – returns a list of unique values in a list or range o array – range or array from which to extract unique values o by_col – [optional] FALSE = sort by row (default). TRUE = sort by column o exactly_once – [optional] FALSE = all unique values (default); TRUE = values that occur once • VLOOKUP(value, table, col_index, [range_lookup]) – looks up a value in a table by matching on the first column and returns the matched value o value – value to look for in the first column of a table o table – table from which to retrieve a value o col_index – The column in the table from which to retrieve a value. o range_lookup – [optional] TRUE = approximate match (default); FALSE = exact match V. DESCRIPTION OF WORKSHOP DATA In this workshop we will be using a sample dataset that contains information about art sales that took place at the Knoedler Gallery in New York during the year 1946. The master datasheet consists of 228 rows and 14 columns (see table below). Data Source: Getty Provenance Index. Variable Data Type Description artist string name of the primary creator of the object sold artist_nationality string the nationality of the artist object_title string the title of the object sold category1 categorical the primary subject category of the object sold category2 categorical the secondary subject category of the object sold type categorical the type of object sold dimensions_in string the dimensions of the object sold in inches sale_year string year in which sale took place sale_month string month in which sale took place sale_day string day on which sale took place sales_total numerical price object sold for sales_currency categorical currency object was sold in seller string name of the seller buyer string name of the buyer 2
WORKSHOP INSTRUCTIONS I. IDENTIFYING ENTITIES & ATTRIBUTES The first step in beginning to break down a master spreadsheet into a series of tables is to first identify the various entities in your dataset and being to map out your data model. A data entity is an object in a data model. In deciding which variables to consider entities, think about columns that may have repeated information (e.g. names, object categories, etc.). In considering which variables should be attributes of these entities, consider which columns contain information about the specific entity. For example, if you have a dataset containing information on sales, that contains also information about the buyer such as an address, then sales and buyers would represent two different entities and be split into two tables (sales and buyers), while buyer_address would become an attribute within the new table for buyers. The newly created tables for sales and buyers would then be related by a common key column, buyer_id. For the purposes of this tutorial, our data entities have been identified as the following: 1. Sales_1946 – table of all sale events that took place in the year 1946 2. Artists – all unique values for artists in the dataset 3. Object Types – all unique values for object types 4. Subject Categories – all unique values for subject categories 5. Collectors – all unique values for sellers and buyers These five entities will become the names of the tables in our relational data model. The attributes for each of these tables will be as follows: 1. Sales_1946 – sale_id, artist_id, object_title, category1_id, category2_id, type_id, dimensions_in, sale_year, sale_month, sale_day, sales_total, sales_currency, seller_id, buyer_id 2. Artists – artist_id, artist_name, artist_nationality 3. Object Types – type_id, type 4. Subject Categories – category_id, category 5. Collectors – collector_id, collector_name N.B. Over the course of the following steps, we will create the ID (key) columns that are listed as attributes above. 3
II. CREATING UNIQUE VALUE LISTS FOR EACH ENTITY Now that we have identified our data entities, we are ready to begin splitting up our master datasheet. To begin, we need to make sure that we create unique value lists for the following columns in the original dataset: artist, category, subcategory, type, seller, and buyer. These variables will form the basis of our artists, subject categories, object types, and collectors entity tables. Since the information in these columns is likely to be repeated in our dataset, we want to make sure that we have just one record per possible value in each of our entity tables. Let’s begin by making the new table for object types, since this will only contain two columns—type_id and type—and only contains values from one of the original columns. 1. Open a new sheet in your excel workbook. 2. Rename this sheet ‘Object Types.’ 3. Title the first two columns in this new sheet ‘type’ and ‘type_id,’ respectively. 4. Go to cell A2 and enter the following formula: =UNIQUE('Knoedler Main'!F:F) This identifies all of the unique values included in the master datasheet’s column F (type) and lists them in your new sheet. You should now have the following listed under type in the Object Types sheet: • type • Painting • Sculpture • Pastel • Watercolor • 0 Since we have called the unique values for all of column F, ‘type’ and ‘0’ have been included as additional unique values. Obviously, we don’t want to include the column header or the zero value in our list, but we can’t just delete the items we 4
don’t want because this is a formula and doing so would erase all of the other values. To fix this: 5. Highlight the whole column and right click to cut it. 6. Then, right click on an empty column and select Paste special à values. (This will paste only the values from the copied cells rather than the formula. 7. Now you have all of the unique values from the original spreadsheet listed here and can delete the first and last rows containing ‘type’ and ‘0,’ respectively. 8. Cut and paste the table to return it to columns A and B of the spreadsheet. III. ASSIGNING UNIQUE KEYS Now we want to assign primary key values for the column type_id. The ID numbers—unique keys—that we generate here will later be used to create relationships among our tables. 9. Enter the value ‘1’ in cell B2. 10. Double-click on the lower right corner of the cell to fill down, then hover over the drop-down list and select Fill Series. This will assign values 1 through 4 to your types. Although you can begin your ID numbers with 1 for every table, it is advisable to use different numbering systems for each entity to avoid confusion. For a small value list such as this single digits work fine, but for lists of hundreds or more values, you may want to begin with, for example, a 5-digit number such as 5000. For the purposes of this tutorial you may choose any numbering format you’d like for each table. IV. TRY IT YOURSELF – Creating an Entity Table 11. Repeat steps 1 through 10 for the column artist in the master datasheet. (Remember to adjust your =UNIQUE( ) function to match the column for artist, and try filling down with different starting ID numbers) For our Artists entity table we want to include an additional attribute for each record, so we will need to add an additional column to our table. 5
12. Title the third column in the table Artists ‘artist_nationality.’ In the next section we will look at how to populate this column with data from our master datasheet. V. USING THE VLOOKUP( ) FUNCTION TO FILL ATTRIBUTE COLUMNS Now that we have our Artists table, we want to fill in the attribute artist_nationality with information from our master datasheet. Instead of going through this list and manually adding the nationalities for each artist, we can use the VLOOKUP( ) function in excel to look up an artist’s name in our Artists table, match it to a value in the column artist from the master datasheet, and return the matching value from the column artist_nationality for that row. 13. Enter the following formula into cell C2 of the table Artists: =VLOOKUP(B2,'Knoedler Main'!A:B,2,FALSE) This looks up the artist name in cell B2 of the Artists table, matches the value to the same name in column A of the master datasheet, then returns the value in the second column of the table (column B), containing the nationality for the matched artists name in column A, to cell C2 of the Artists table. The FALSE indicates that you only want exact matches rather than approximate matches. 14. Double-click on the lower right corner of the cell to fill down. It should default to Fill series, but you can always double check by hovering over the drop- down to see which fill down form is selected. 15. As we did after using the UNIQUE( ) function, select the whole table (all three columns), and right click to cut the table, and then Paste special à values into an empty column to remove the formula and save only the returned values. 16. Cut and paste the table to return it to columns A, B and C of the spreadsheet. VI. USING DATA FROM TWO COLUMNS TO CREATE AN ENTITY TABLE Since some of the buyers in this dataset might also be sellers and vice versa, it would be useful to create just one entity table for all “collectors” present in the master datasheet. In order to do this we will create a unique value list for all of the names that appear in the buyer and seller columns of our master datasheet. 17. Open a new sheet in the excel workbook. 6
18. Rename this sheet ‘Collectors.’ 19. Copy and paste all of the data from the column buyers into column A this new sheet. 20. Then, copy and paste all of the data from the column sellers to the end of the list in column A of the new sheet. 21. In column B of the Collectors sheet, enter the following formula in cell B2: =UNIQUE(A:A). This grabs all of the unique values from the combined list of buyers and sellers to create one unique value list for all collectors. 22. Cut and Paste special à values to place only the text values into cell C2. 23. Delete columns A and B, as well as any rows with ‘0.’ 24. Name this column ‘collector_name.’ 25. Name what is now column B ‘collector_id.’ 26. Assign primary key values to column B, as we did in Part III. IV. TRY IT YOURSELF – Using Two Columns to Create an Entity Table 27. Repeat steps 17 through 26 for the columns category1 and category2 in the master datasheet to create the entity table, Subject Categories, with a column for category and category_id. VI. RELATING TABLES USING FOREIGN KEYS Now that we have created four of our five entity tables, we need to create our final table for our sales entities. This table will have each sale event as a record (row), populated with the data pertaining to the specifics of the sale (object_title, dimensions_in, sale_year, sale_month, sale_day, sales_total, and sales_curr) as well as the foreign keys (ID numbers) for the following variables in our other tables: artist_id, category1_id, category2_id, type_id, seller_id, buyer_id. We will also create a sale_id column to act as a primary key in this new table, Sales_1946. Here, we will again make use of the VLOOKUP( ) function in Excel. 7
N.B. For the VLOOKUP ( ) to work for the following steps, the primary key column must be the right- most column in each of four new datasheets. The column containing the entity name (i.e. artist_name, type, category, and collector_name) must be in the left-most column of the table. If this is not already the case, move the key columns to this position. 28. Copy all of the data from the original file into a new sheet. 29. Insert a new column between artist and artist_nationality. Name this column artist_id. We will now use the VLOOKUP() function to populate this new column with the keys for artist names that we created for the Artists table: 30. Enter the following formula into cell B2 of the new table: =VLOOKUP(A2,Artists!A:C,3,FALSE) 31. Then, fill down (fill series) for all other rows. 32. Repeat this process to create the columns category1_id, category2_id, type_id, seller_id and buyer_id. Remember that category1_id and category2_id will both populate from the Subject Categories table, and seller_id and buyer_id will both populate from the Collectors table. N.B. If no exact matching value can be found (i.e. there is a NULL value for that particular variable within that record), the formula will return #N/A to the cell. Leave this for now, we will return to this later on. 33. Copy all of the data from the sheet you’ve just populated and Paste special à values into a new sheet. 34. Title this new sheet Sales_1946. You may now delete the sheet you populated in steps 28 through 32. Now that we’ve used the keys from our other four tables to create relationships to Sales_1946, we can delete the columns containing data from our other tables. 35. Delete the following columns from Sales_1946: • artist_nationality • category1 • category2 8
• type • seller • buyer 36. Finally, add a sale_id column for Sales_1946, and fill series to create a primary key column for the table. VII. FINISHING UP (optional) For programs such as MySQL you will want to make sure that any blank cells read as nulls, you want to make sure that NULL appears in all of these null-value cells. You can use search and replace in Excel to make sure that this is the case. You will also want to do this for any values that appear as #N/A, as this will not read as a null in MySQL. Finally, now that you have created these five normalized tables based on the original master datasheet, you can save each of these as a .csv file that can be uploaded into programs such as phpMyAdmin or MySQL Workbench to populate a relational database model with existing data. 9
You can also read