DATA SCIENCE MEI/1 University of Beira Interior, Department of Informatics Hugo Pedro Proença, , 2020/2021

Page created by Jerome Little
 
CONTINUE READING
DATA SCIENCE
MEI/1
University of Beira Interior,
Department of Informatics
Hugo Pedro Proença,
hugomcp@di.ubi.pt, 2020/2021
Key Data Structures in Data Science

• Data strctures are used to store data in na organized way, in
 order to make data manipulation eficient.
 • Typically, using ETL processes, data are imported from one (or
 several databases) into this kind of structures.
• Vectors
 • They are one of the most eficient and simple data structures, due to
 their homogenous nature.
 • In Python, the “Numpy” library is typically used for creating vectors
 • vec_row = np.array([1, 2, 3])
• Matrices
 • Matrices are two-dimensional data structures, also homogenous
 (i.e., all elements are of the same type)
 • The “Numpy” library is also typically used to create matrices.
 • Matrix = nop.mat([1,2], [3,4], [5,6])
Key Data Structures in Data Science

• Arrays
 • Arrays are the general form of vectors and matrices, and have a multi-
 dimensional shape.
 • Typically, they have not the homogeneity constraint, i.e., diferente data
 types can be included in each dimension of the array.
 • In Python “Lists” are the closest semantical data structure to the concept of
 array
 • A=[[1, ’Volvo’], [2, ’BMW]]
• Data Frames
 • Data frames are 2-dimensional arrays that resemble database tables. Each
 column contains one variable and each row contains one instance
 • In Python, they are typically created using “panda” library:
 • cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus'],
 'Price': [22000,25000,27000,35000]
 }
 df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
Key Data Structures in Data Science

• Dictionaries
 • Also known as “Hash maps”, they support arbitrary keys and values.
 Keys are unique identifies of instances in the data strcucture.
 • They are unordered, mutable and indexed.
 • In Python, they are created using curly brackets:
 • D = {1: [1, 2, 3, 4], 'Name': 'Bill’}
• Tupples
 • Tupples regard one instance, where elements are ordered and
 immutable. A tuple can have any number of itens of diferente types.
 • In Python, we simply create a variable with parenthesis.
 • tuple1 = ("apple",1, False)
Amortized Analysis and Computational Performance

• Algorithmic complexity is a crucial concept in Data Science.
 Knowing the complexity of algorithms allows to answer various
 questions:
 • Is the problem solvable?
 • How long will my processing chain take to run?
 • How much space will it take?
• The concept of amortized analysis is closely related to
 Asymptotic Analysis.
• The classical asymptotic analysis aims at analyzing the
 performance of an individual operation asymptotically, as a
 function of the size of the problem.
• The goal is to perceive how the performance of a given
 operation will scale to a large data set.
Amortized Analysis and Computational Performance

• The key difference between Asymptotic and Amortised
 Analysis is that the former is dependent on the input itself,
 while the latter is dependent on the sequence of operations
 the algorithm will execute.
• In summary:
 • Asymptotic analysis allows to assert that the complexity of
 the algorithm when it is given a worst/average case input of
 size n is bounded by some function f(n)
 • Amortised analysis allows to assert that the complexity of
 the algorithm when it is given an input of unknown
 characteristics but known size n is no worse than the value
 of a function f(n)
Asymptotic Analysis

• Typically, there are two modes for performing the asymptotic analysis
 of an algorithm (processing chain):
• The worst-case mode considers a single operation.
• To find the overall cost of the algorithm we need to find the worst-case
 cost of every single operation and then count the number of their
 executions.
• If algorithm runs in time T(n) it means that it is and upper bound for any
 inputs of size n.
 • Even if the algorithm may take less time on some inputs of that size, due to
 particular operations may cheaper for them, the idea is to always count the
 worst-cost of every operation in the algorithm.
• The average-case mode aims at obtaining the running time for randomly
 chosen inputs. It is considered harder to obtain due to the fact it needs
 some probabilistic arguments and some assumptions about the
 distribution of the inputs.
• Despite that it may be a lot more useful, hence the worst-case analysis is
 often misleading. For example, the worst-case temporal complexity for
 the quick-sort algorithm is 2n and the average-case is n*log(n).
Asymptotic Analysis: Big-O Notation

• The order of growth describes how an
 algorithm/processing chain time and space complexity
 will increase with respect to the size of the input.
• There are various notations to measure the order of
 growth, but the most popular is the Big-O notation,
 which gives the worst-case time complexity. For
 instance, O(f(x)) = g(x) it means that the growth of the
 function f() will never surpass the function g().
 • In this setting, g() is the asymptotic upper bound on the time
 complexity of f().
Asymptotic Analysis: Big-O Notation
• For instance, consider a simple nested for loop:

 for (i=1; i
Asymptotic Analysis: Big-O Notation
• The asymptotic analysis has two major weaknesses:
 • As it ignores constants in the g() function, two algorithms that
 in practice have very different performance will get the same
 asymptotic bound.
 • For example, if one algorithm takes 999*n*log(n) steps, and
 another ones takes 2*n*log(n), their asymptotic bound will be
 the same: n*log(n)
 • Another feature is that the worst-case scenario (input) might
 never happen, or have extremely low probability. In practice,
 this means that an algorithm asymptotically slower than other
 might actually perform better, because of the inputs
 distribution
Amortized Analysis

• Considering the weaknesses of Asymptotic Analysis, the
 concept of Amortized Analysis can be seen as more reliable,
 particularly for complex processing chains.
• Amortised Analysis aims at perceive how the average
 performance of all the operations on a large data set scales.
• Comparing to the average-case mode of Asymptotic Analysis,
 amortised analysis gives an upper bound of the actual cost of
 an algorithm, which the average-case doesn’t guarantee.
• In summary, it gives the average performance (over time) of
 each operation in the worst-case.
Amortized Analysis

• Considering a particular sequence of operations, it is not expected
 that the worst-case occur very often in each operation.
• In practice, the operations vary in their costs: some may be cheap
 and some may be expensive.
• For example, consider a dynamic array data structure.
• In this kind of data structure, the ordered insertion of elements can
 take different times: linear or constant (if elements are inserted in
 order).
• In this case, if the operations have different costs, how can we
 correctly obtain the total time?
• This is where the Amortised Analysis comes into play. It assigns an
 artificial cost to each operation in the sequence, which is called the
 Amortised Cost.
• This way, the total cost of the algorithm is bounded by the total
 number of the amortised costs of all operations.
Amortized Analysis

• Thera are three methods for obtaining the Amortized Cost:
 • Aggregate Method (brute-force);
 • Accounting Method (the banker method);
 • Potential Method (the physicist method);
• Aggregate Method
 • When considering the dynamic array as example, suppose that when
 the array has space available, we simply insert the new item in the first
 available space. Otherwise, the following steps are performed:
 • Allocate memory for a larger array of size twice as the old one
 • Copy the contents of old array to new one
 • Let’s assume first that the cost for insert is equal to 1 unit and resizing
 an array costs us 1 unit per each element in the array.
 • The cost of inserting the “ith” element is given by:
 • Cost_i()
 if (i-1 is a power of 2)
 i;
 else
 1;
Amortized Analysis

• The cost of inserting the first element is 2 (create first space and
 insert)
• The cost of inserting the second element is also 2
• The cost of inserting the third element is 3
• The cost of inserting the fourth element is 1
• The cost of inserting the fifth element is 5
• The cost of inserting the sixth element is 1
• In general, the total cost is (2 + 2 + 3 + 1 + 5 + 1 + 1) / 7= 2.5
• Considering that we omit constants, the amortized cost is O(1)
• Aggregate Analysis determines the upper bound T(n) on the cost
 of “n” operations, and then obtains the amortized cost, given by
 T(n)/n
Amortized Analysis
• The Accounting method has a simple rationale. There is an account where
 we can save up time and every operation is allowed to take some time from
 the account.
• The cheap operations help to pay for the most expensive ones. By
 distributing the costs this way, we get some kind of average.
• We stop for the first operation that sets the balance to 0.
• Supposing that we define a cost of ”3”
• The cost of inserting the first element is 2 (create first space and insert)
 (Balance = 1, i.e., 3-2)
• The cost of inserting the second element is also 2 (Balance = 2, i.e., 1+3-2)
• The cost of inserting the third element is 3 (Balance=2)
• The cost of inserting the fourth element is 1 (Balance=4)
• The cost of inserting the fifth element is 5 (Balance=-1)
• We got a negative balance, so “3” is not enough. We should repeat the
 experiments for cost “4”.
Amortized Analysis
• The Potential Method is based in a potential function that that should have two
 properties: ℎ0 = 0, being h0 the initial state of the structure and ℎ ≥ 0.
• The amortized time of an operation is then given by: c + ℎ - ℎ − 1 where c is
 the real cost of the operation ℎ I the state of the structure after the operation
 and ℎ − 1 the corresponding state before the operation.
• Ideally, should be defined such that the amortised time of each operation is
 small, because the change in potential should be greater than 0 for cheap
 operations and lower for expensive operations.
• Considering the previous case of dynamic arrays, if:
 • ℎ = 2n –m, where n is the number of elements in the array and m is the array length
• We have two cases:
 • n < m, then the actual cost is 1, n increases by 1, and m does not change. Then the
 potential increases by 2, so the amortised time is 1 + 2 = 3.
 • n = m, then the array is doubled, so the actual time is n + 1. But the potential drops from
 n to 2, so amortised time is n + 1 + (2 − n) = 3.
• In both the aforementioned cases, the amortized time is O(1)
You can also read