TRANSFORMING SINGLE SPREADSHEETS INTO NORMALIZED TABLES USING EXCEL - Kaylee P. Alexander
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TRANSFORMING SINGLE
SPREADSHEETS INTO
NORMALIZED TABLES USING EXCEL
Kaylee P. Alexander askdata@duke.eduVariable Data Type Description
artist string name of the primary creator of the object sold
DESCRIPTION OF
WORKSHOP DATA artist_nationality string the nationality of the artist
object_title string the title of the object sold
category1 categorical the primary subject category of the object sold
In this workshop we will be using a sample category2 categorical the secondary subject category of the object sold
dataset that contains information about art
type categorical the type of object sold
sales that took place at the Knoedler Gallery in
New York during the year 1946. dimensions_in string the dimensions of the object sold in inches
The master datasheet consists of 228 rows and sale_year string year in which sale took place
14 columns (see table below).
sale_month string month in which sale took place
sale_day string day on which sale took place
sales_total numerical price object sold for
sales_currency categorical currency object was sold in
seller string name of the seller
buyer string name of the buyer
Data Source: http://www.getty.edu/research/tools/provenance/WHAT IS A RELATIONAL DATA MODEL? A relational data model organizes data into a series of tables containing columns (‘attributes’) and rows (‘records’) with unique keys identifying each record. Each table (or, ‘relation’) represents a single entity type and its attributes. The primary benefits of a relational data model include ensuring consistency as well as performing combinations of queries to understand various relationships that exist among the information contained in the various tables that would be otherwise difficult to determine from a single spreadsheet. An additional benefit is the ability to add records and edit information without the risk of compromising other information contained in the database. For more on relational data models, see: https://www.oracle.com/database/what-is-a-relational- database/
SINGLE RELATIONAL DATA MODEL
DATASHEET
Knoedler Main Artists Sales_1946
artist_id Subject Categories
artist sale_id
category_id
artist_nationality artist_name artist_id
category
object_title artist_nationality object_title
category1 category1_id
category2 category2_id
type type_id
dimensions_in dimensions_in
sale_year sale_year
Object Types
sale_month Collectors sale_month
type_id
sale_day collector_id sale_day
type
sales_total collector_name sales_total
sales_currency sales_currency
seller seller_id
buyer buyer_idWHAT IS DATA NORMALIZATION? Data normalization is the process of organizing and restructuring data attributes within a relational data model in order to reduce redundancy in the data set, increase consistency, and facilitate querying. For more on database normalization, see: http://agiledata.org/essays/dataNormalization.html.
GOALS
• Demonstrate simple steps in Excel that can be used to transform a
single spreadsheet into a series of normalized tables for a relational
data model
• Identify data entities and attributes
• Create unique value lists using UNIQUE( )
• Assign unique keys to new tables
• Populate columns in new tables using VLOOKUP( )
• Relate tables using foreign keysEXCEL FUNCTIONS USED IN THIS
WORKSHOP
• UNIQUE(array, [by_col], [exactly_once]) – returns a list of unique values in a list or range
• array – range or array from which to extract unique values
• by_col – [optional] FALSE = sort by row (default). TRUE = sort by column
• exactly_once – [optional] FALSE = all unique values (default); TRUE = values that occur once
• VLOOKUP(value, table, col_index, [range_lookup]) – looks up a value in a table by matching
on the first column and returns the matched value
• value – value to look for in the first column of a table
• table – table from which to retrieve a value
• col_index – The column in the table from which to retrieve a value.
• range_lookup – [optional] TRUE = approximate match (default); FALSE = exact matchTransforming Single Spreadsheets into
Normalized Tables Using Excel
Prepared by Kaylee P. Alexander | CDVS Graduate Assistant, Summer 2020
I. PURPOSE AND GOALS
The purpose of this workshop is to demonstrate simple steps in Excel that you can
take to transform a single spreadsheet (such as a master copy of your data that you
used to facilitate the gathering process) into a series of normalized tables that can
be used to populate a relational database model using, for example, MySQL. In
order to accomplish this, we first identify entities and corresponding attributes of
those entities within the master datasheet, and then create separate tables for each
entity that can be connected to one another by foreign keys (columns that reference,
by means of an identification code, columns present in other tables). This workshop
does not require any coding experience, but it is recommended that users are
familiar with Excel basics.
Over the course of the workshop we will transform a single datasheet into five
tables, each representing a set of unique entities (and their related attributes) from
the original datasheet as well as a unique key that can be used to create
relationships between the different tables.
II. WHAT IS A RELATIONAL DATA MODEL?
A relational data model organizes data into a series of tables containing columns
(‘attributes’) and rows (‘records’) with unique keys identifying each record. Each table
(or, ‘relation’) represents a single entity type and its attributes. The primary benefits
of a relational data model include ensuring consistency as well as performing
combinations of queries to understand various relationships that exist among the
information contained in the various tables that would be otherwise difficult to
determine from a single spreadsheet. An additional benefit is the ability to add
records and edit information without the risk of compromising other information
contained in the database. For more on relational data models, see:
https://www.oracle.com/database/what-is-a-relational-database/.
III. WHAT IS DATA NORMALIZATION?
Data normalization is the process of organizing and restructuring data attributes
within a relational data model in order to reduce redundancy in the data set,
increase consistency, and facilitate querying. For more on database normalization,
see: http://agiledata.org/essays/dataNormalization.html.
1IV. EXCEL FUNCTIONS USED IN THIS WORKSHOP
• UNIQUE(array, [by_col], [exactly_once]) – returns a list of unique values in a list
or range
o array – range or array from which to extract unique values
o by_col – [optional] FALSE = sort by row (default). TRUE = sort by column
o exactly_once – [optional] FALSE = all unique values (default); TRUE =
values that occur once
• VLOOKUP(value, table, col_index, [range_lookup]) – looks up a value in a table
by matching on the first column and returns the matched value
o value – value to look for in the first column of a table
o table – table from which to retrieve a value
o col_index – The column in the table from which to retrieve a value.
o range_lookup – [optional] TRUE = approximate match (default); FALSE =
exact match
V. DESCRIPTION OF WORKSHOP DATA
In this workshop we will be using a sample dataset that contains information about
art sales that took place at the Knoedler Gallery in New York during the year 1946.
The master datasheet consists of 228 rows and 14 columns (see table below). Data
Source: Getty Provenance Index.
Variable Data Type Description
artist string name of the primary creator of the object sold
artist_nationality string the nationality of the artist
object_title string the title of the object sold
category1 categorical the primary subject category of the object sold
category2 categorical the secondary subject category of the object sold
type categorical the type of object sold
dimensions_in string the dimensions of the object sold in inches
sale_year string year in which sale took place
sale_month string month in which sale took place
sale_day string day on which sale took place
sales_total numerical price object sold for
sales_currency categorical currency object was sold in
seller string name of the seller
buyer string name of the buyer
2WORKSHOP INSTRUCTIONS
I. IDENTIFYING ENTITIES & ATTRIBUTES
The first step in beginning to break down a master spreadsheet into a series of
tables is to first identify the various entities in your dataset and being to map out
your data model. A data entity is an object in a data model. In deciding which
variables to consider entities, think about columns that may have repeated
information (e.g. names, object categories, etc.). In considering which variables
should be attributes of these entities, consider which columns contain information
about the specific entity. For example, if you have a dataset containing information
on sales, that contains also information about the buyer such as an address, then
sales and buyers would represent two different entities and be split into two tables
(sales and buyers), while buyer_address would become an attribute within the new
table for buyers. The newly created tables for sales and buyers would then be
related by a common key column, buyer_id.
For the purposes of this tutorial, our data entities have been identified as the
following:
1. Sales_1946 – table of all sale events that took place in the year 1946
2. Artists – all unique values for artists in the dataset
3. Object Types – all unique values for object types
4. Subject Categories – all unique values for subject categories
5. Collectors – all unique values for sellers and buyers
These five entities will become the names of the tables in our relational data model.
The attributes for each of these tables will be as follows:
1. Sales_1946 – sale_id, artist_id, object_title, category1_id, category2_id,
type_id, dimensions_in, sale_year, sale_month, sale_day, sales_total,
sales_currency, seller_id, buyer_id
2. Artists – artist_id, artist_name, artist_nationality
3. Object Types – type_id, type
4. Subject Categories – category_id, category
5. Collectors – collector_id, collector_name
N.B. Over the course of the following steps, we will create the ID (key) columns that are listed as
attributes above.
3II. CREATING UNIQUE VALUE LISTS FOR EACH ENTITY
Now that we have identified our data entities, we are ready to begin splitting up
our master datasheet.
To begin, we need to make sure that we create unique value lists for the following
columns in the original dataset: artist, category, subcategory, type, seller, and
buyer. These variables will form the basis of our artists, subject categories, object
types, and collectors entity tables. Since the information in these columns is likely to
be repeated in our dataset, we want to make sure that we have just one record per
possible value in each of our entity tables.
Let’s begin by making the new table for object types, since this will only contain two
columns—type_id and type—and only contains values from one of the original
columns.
1. Open a new sheet in your excel workbook.
2. Rename this sheet ‘Object Types.’
3. Title the first two columns in this new sheet ‘type’ and ‘type_id,’ respectively.
4. Go to cell A2 and enter the following formula: =UNIQUE('Knoedler Main'!F:F)
This identifies all of the unique values included in the master datasheet’s column F
(type) and lists them in your new sheet. You should now have the following listed
under type in the Object Types sheet:
• type
• Painting
• Sculpture
• Pastel
• Watercolor
• 0
Since we have called the unique values for all of column F, ‘type’ and ‘0’ have been
included as additional unique values. Obviously, we don’t want to include the
column header or the zero value in our list, but we can’t just delete the items we
4don’t want because this is a formula and doing so would erase all of the other
values. To fix this:
5. Highlight the whole column and right click to cut it.
6. Then, right click on an empty column and select Paste special à values. (This
will paste only the values from the copied cells rather than the formula.
7. Now you have all of the unique values from the original spreadsheet listed
here and can delete the first and last rows containing ‘type’ and ‘0,’
respectively.
8. Cut and paste the table to return it to columns A and B of the spreadsheet.
III. ASSIGNING UNIQUE KEYS
Now we want to assign primary key values for the column type_id. The ID
numbers—unique keys—that we generate here will later be used to create
relationships among our tables.
9. Enter the value ‘1’ in cell B2.
10. Double-click on the lower right corner of the cell to fill down, then hover over
the drop-down list and select Fill Series.
This will assign values 1 through 4 to your types. Although you can begin your ID
numbers with 1 for every table, it is advisable to use different numbering systems
for each entity to avoid confusion. For a small value list such as this single digits
work fine, but for lists of hundreds or more values, you may want to begin with, for
example, a 5-digit number such as 5000. For the purposes of this tutorial you may
choose any numbering format you’d like for each table.
IV. TRY IT YOURSELF – Creating an Entity Table
11. Repeat steps 1 through 10 for the column artist in the master datasheet.
(Remember to adjust your =UNIQUE( ) function to match the column for artist,
and try filling down with different starting ID numbers)
For our Artists entity table we want to include an additional attribute for each
record, so we will need to add an additional column to our table.
512. Title the third column in the table Artists ‘artist_nationality.’
In the next section we will look at how to populate this column with data from our
master datasheet.
V. USING THE VLOOKUP( ) FUNCTION TO FILL ATTRIBUTE COLUMNS
Now that we have our Artists table, we want to fill in the attribute artist_nationality
with information from our master datasheet. Instead of going through this list and
manually adding the nationalities for each artist, we can use the VLOOKUP( )
function in excel to look up an artist’s name in our Artists table, match it to a value
in the column artist from the master datasheet, and return the matching value from
the column artist_nationality for that row.
13. Enter the following formula into cell C2 of the table Artists:
=VLOOKUP(B2,'Knoedler Main'!A:B,2,FALSE)
This looks up the artist name in cell B2 of the Artists table, matches the value to the
same name in column A of the master datasheet, then returns the value in the
second column of the table (column B), containing the nationality for the matched
artists name in column A, to cell C2 of the Artists table. The FALSE indicates that
you only want exact matches rather than approximate matches.
14. Double-click on the lower right corner of the cell to fill down. It should default
to Fill series, but you can always double check by hovering over the drop-
down to see which fill down form is selected.
15. As we did after using the UNIQUE( ) function, select the whole table (all three
columns), and right click to cut the table, and then Paste special à values into
an empty column to remove the formula and save only the returned values.
16. Cut and paste the table to return it to columns A, B and C of the spreadsheet.
VI. USING DATA FROM TWO COLUMNS TO CREATE AN ENTITY TABLE
Since some of the buyers in this dataset might also be sellers and vice versa, it
would be useful to create just one entity table for all “collectors” present in the
master datasheet. In order to do this we will create a unique value list for all of the
names that appear in the buyer and seller columns of our master datasheet.
17. Open a new sheet in the excel workbook.
618. Rename this sheet ‘Collectors.’
19. Copy and paste all of the data from the column buyers into column A this new
sheet.
20. Then, copy and paste all of the data from the column sellers to the end of the
list in column A of the new sheet.
21. In column B of the Collectors sheet, enter the following formula in cell B2:
=UNIQUE(A:A).
This grabs all of the unique values from the combined list of buyers and sellers to
create one unique value list for all collectors.
22. Cut and Paste special à values to place only the text values into cell C2.
23. Delete columns A and B, as well as any rows with ‘0.’
24. Name this column ‘collector_name.’
25. Name what is now column B ‘collector_id.’
26. Assign primary key values to column B, as we did in Part III.
IV. TRY IT YOURSELF – Using Two Columns to Create an Entity Table
27. Repeat steps 17 through 26 for the columns category1 and category2 in the
master datasheet to create the entity table, Subject Categories, with a column
for category and category_id.
VI. RELATING TABLES USING FOREIGN KEYS
Now that we have created four of our five entity tables, we need to create our final
table for our sales entities. This table will have each sale event as a record (row),
populated with the data pertaining to the specifics of the sale (object_title,
dimensions_in, sale_year, sale_month, sale_day, sales_total, and sales_curr) as well
as the foreign keys (ID numbers) for the following variables in our other tables:
artist_id, category1_id, category2_id, type_id, seller_id, buyer_id. We will also
create a sale_id column to act as a primary key in this new table, Sales_1946. Here,
we will again make use of the VLOOKUP( ) function in Excel.
7N.B. For the VLOOKUP ( ) to work for the following steps, the primary key column must be the right-
most column in each of four new datasheets. The column containing the entity name (i.e.
artist_name, type, category, and collector_name) must be in the left-most column of the table. If this
is not already the case, move the key columns to this position.
28. Copy all of the data from the original file into a new sheet.
29. Insert a new column between artist and artist_nationality. Name this column
artist_id.
We will now use the VLOOKUP() function to populate this new column with the keys
for artist names that we created for the Artists table:
30. Enter the following formula into cell B2 of the new table:
=VLOOKUP(A2,Artists!A:C,3,FALSE)
31. Then, fill down (fill series) for all other rows.
32. Repeat this process to create the columns category1_id, category2_id,
type_id, seller_id and buyer_id. Remember that category1_id and
category2_id will both populate from the Subject Categories table, and
seller_id and buyer_id will both populate from the Collectors table.
N.B. If no exact matching value can be found (i.e. there is a NULL value for that particular variable
within that record), the formula will return #N/A to the cell. Leave this for now, we will return to this
later on.
33. Copy all of the data from the sheet you’ve just populated and Paste special à
values into a new sheet.
34. Title this new sheet Sales_1946. You may now delete the sheet you populated
in steps 28 through 32.
Now that we’ve used the keys from our other four tables to create relationships to
Sales_1946, we can delete the columns containing data from our other tables.
35. Delete the following columns from Sales_1946:
• artist_nationality
• category1
• category2
8• type
• seller
• buyer
36. Finally, add a sale_id column for Sales_1946, and fill series to create a primary
key column for the table.
VII. FINISHING UP (optional)
For programs such as MySQL you will want to make sure that any blank cells read as
nulls, you want to make sure that NULL appears in all of these null-value cells. You
can use search and replace in Excel to make sure that this is the case. You will also
want to do this for any values that appear as #N/A, as this will not read as a null in
MySQL.
Finally, now that you have created these five normalized tables based on the
original master datasheet, you can save each of these as a .csv file that can be
uploaded into programs such as phpMyAdmin or MySQL Workbench to populate a
relational database model with existing data.
9You can also read