Scientific Project: Databases for Multi-dimensional Data, Genomics and modern Hardware
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Scientific Project: Databases for Multi-dimensional Data, Genomics and modern Hardware David Broneske, Gabriel Campero, Bala Gurumurthy, Marcus Pinnecke, Gunter Saake , October 18, 2019 1 / 37
Overview • Concepts of this course • Course of action (milestones, presentations) • Overview of project topics & forming project teams • How to perform literature research? 2 / 37 Scientific Project David Broneske et al.
Overview • Concepts of this course • Course of action (milestones, presentations) • Overview of project topics & forming project teams • How to perform literature research? • Further lectures: Academic writing (2 lectures) 3 / 37 Scientific Project David Broneske et al.
Organization
Scientific Project: Modules Bachelor • Module: WPF FIN SMK (Schlüssel- und Methodenkompetenzen) • 5 CP = 150h ⇒ 42h presence time (3 SWS) + 108h autonomous work Master • Module: Scientific Team Project (Inf, IngInf, WIF, CV) DKE: Methods 2 or Applications DE: Interdisciplinary Team Project, Specialization • 6 CP = 180h ⇒ 42h presence time (3 SWS) + 138h autonomous work Grade at the end of the course for the whole project team 5 / 37 Scientific Project David Broneske et al.
Scientific Project: Prerequisite • Successful programming test in C++/Java/Python • 1h theoretical test in a seminar room (data and place to be discussed) • Half of the team members have to pass the test • Topics: Some language specifics General program understanding Control flow understanding • You can take all tests and have to pass at least one! 6 / 37 Scientific Project David Broneske et al.
Scientific Project: Semester Plan Monday Tuesday Wednesday Thursday Friday Introduction 14.10.2019 15.10.2019 16.10.2019 17.10.2019 18.10.2019 Team Formation 21.10.2019 22.10.2019 23.10.2019 24.10.2019 25.10.2019 Final Teams 28.10.2019 29.10.2019 30.10.2019 31.10.2019 01.11.2019 04.11.2019 05.11.2019 06.11.2019 07.11.2019 08.11.2019 MS-I 11.11.2019 12.11.2019 13.11.2019 14.11.2019 15.11.2019 Aca-I 18.11.2019 19.11.2019 20.11.2019 21.11.2019 22.11.2019 Aca-II 25.11.2019 26.11.2019 27.11.2019 28.11.2019 29.11.2019 MS II 02.12.2019 03.12.2019 04.12.2019 05.12.2019 06.12.2019 09.12.2019 10.12.2019 11.12.2019 12.12.2019 13.12.2019 16.12.2019 17.12.2019 18.12.2019 19.12.2019 20.12.2019 Winter Break ... ... ... ... ... MS III 06.01.2020 07.01.2020 08.01.2020 09.01.2020 10.01.2020 13.01.2020 14.01.2020 15.01.2020 16.01.2020 17.01.2020 20.01.2020 21.01.2020 22.01.2020 23.01.2020 24.01.2020 MS Final 27.01.2020 28.01.2020 29.01.2020 30.01.2020 31.01.2020 7 / 37 Scientific Project David Broneske et al.
Scientific Project: Milestones • Milestone I - Topic, schedule, and team presentation & first results of literature research • Milestone II - Concept & additional literature research • Milestone III - Implementation & evaluation setup • Milestone IV - Final presentation (wrap-up + evaluation results) 8 / 37 Scientific Project David Broneske et al.
Concepts & Content
Lecture, Meetings & Presentation Lecture & Presentation • Time/Place: Friday, 13:00-15:00, G22A - 208 • Lectures with content of course → all • Presentation of main milestones (see time table) → each project team Meetings (Exercise) • Individual for each project team • Time and room to be agreed in project teams! • Presentation of all intermediate results/milestones (informal) • Discussion, discussion, discussion . . . 10 / 37 Scientific Project David Broneske et al.
Progress of Course Deliveries • 4 milestone presentations (main milestones) • Each team member has to present at least once • Reporting of (sub) milestones in exercises/meetings • Written paper about literature research (technical report) • Prototypical implementation 11 / 37 Scientific Project David Broneske et al.
Deliveries and Grading (I) Technical Report • Delivery of report at a given time (deadline) • Quality/Quantity of literature research • Number of pages • Quality of paper structure and evaluation • Own contribution 12 / 37 Scientific Project David Broneske et al.
Deliveries and Grading (II) Presentation & Discussion • Quality of scientific presentation (structure, references, time) • Assessment regarding the content (e.g., results of particular milestones) • Participation of discussion Organization • Strictness • Communication (just-in-time answers, satisfying time constraints) • Self-organization (Sharing tasks, internal reporting of current state-of-work, dealing with problems) • Autonomous working 13 / 37 Scientific Project David Broneske et al.
Deliveries and Grading (III) • Grade consists of: Presentations: 30%, Implementation: 30%, Paper: 30%, Soft Skills: 10% • Binding registration: Second Milestone 14 / 37 Scientific Project David Broneske et al.
Objectives & Qualification (I) Acquired skills, specific to research • Performing literature research • Understanding and structured reviewing of scientific work • Autonomous, solution-based reasoning on research task (e.g., finding alternative solutions) • How to ask? How to adapt a task (extend/reduce)? • Academic writing 15 / 37 Scientific Project David Broneske et al.
Objectives & Qualification (II) Acquired skills, always needed • Team management • Project and time scheduling • Presentation of results • Flexibility regarding changing conditions • Reasoning about solutions (”Why is this the best/not adequate. . . ”) 16 / 37 Scientific Project David Broneske et al.
Task & Time Management Task Management • Main milestones have to be finished in time • (Sub) milestones are less strict (but don’t be sloppy) • Pre-defined work packages ⇒ each project team . . . defines sub work packages . . . determines responsibilities for these packages (divide&conquer) Time Management • Planning of periods • Regarding capacities and resources • Considering other tasks and activities • Reporting of delays immediately to project members ! 17 / 37 Scientific Project David Broneske et al.
Role Management • Possible roles: team leader, developer, researcher, . . . • work together vs. responsibilities: design, implementation, testing, writing, . . . • Delegate for important roles/work packages • Assignment of (sub) tasks to role for each milestone 18 / 37 Scientific Project David Broneske et al.
Topic & Project Teams • Teams with 4 to 6 students • Most tasks can be chosen once • Projects Theoretical part • State of the art • New ideas Practical part • Usually in C++, Java, or Python • Prototypical implementation • Evaluation part 19 / 37 Scientific Project David Broneske et al.
Topic 1 - Fragment Skipper (Gridformation) Intro: Data skipping in Hadoop • A good data fragmentation is necessary for scalable processing in Big Data. • Aggressive data skipping is the SOTA ML approach, using hierarchical agglomerative clustering with Ward’s method. • We developed a competitor based on deep reinforcement learning. How good is our solution? (can we make it better?) Your Task • Literature research: aggressive data skipping, basics on deep reinforcement learning. • Prototypical implementation of aggressive data skipping using Spark. Experimental evaluation and analysis comparing with our DRL solution. • Long-term goal: A production-ready open source AI-based 20 / 37 partitioning Scientific Project tool & a paper for consideration in SysML. David Broneske et al.
Topic 2 - Similarity Skipper (Sim-Skip) Intro: Learning to hash for high dimensional data management • Top-k search in dense high-dimensional vectors requires specialized solutions. This is a very relevant application. • When data is large, even parallel scans will be inefficient. • Learned hashing is the current SOTA for managing such kind of data in image domains. How good does this technique perform on structured relational data? Your Task • Literature research: Hadoop file formats, deep hashing, triplet training of neural networks. • Prototypical implementation of a deep hashing process using Tensorflow and Spark. Experimental evaluation and analysis. • Long-term goal: A workshop paper in DEEM@SIGMOD. 21 / 37 Scientific Project David Broneske et al.
Topic 3 - Graphs processing w/ML (MLGQE:Malko) Intro: Applications of deep learning for graph data processing • Graphs are everywhere & graph technologies are a moving target. To date still very heuristic-driven. ML is barely tested. • Differential neural computers are already able to act like primitive graph query engines. • How good do models like these fare nowadays for graph processing, and what needs to be improved? Your Task • Literature research: differential neural computers, graph nets, RDF3x. • Prototypical implementation using existing libraries from Deepmind, selected datasets. Experimental evaluation and analysis. • Long-term goal: A workshop paper in GRADES/AIDM@SIGMOD, or AIDB@VLDB. 22 / 37 Scientific Project David Broneske et al.
Topic 4 - Cross device data parallel query processing Intro • Heterogeneous hardware used for improving performance • Work partitioning is problematic and requires training data for best device selection • Further, data parallel execution cannot be decided in prior We’ve got • Dispatcher implementation • Iterative functional parallel executor Your Task • Literature Research: data-parallel, iterative and cross-device execution systems • Understanding of morsel driven parallelism • Invention of a clever concept to perform data as well as functional parallel execution across devices • Implementation of your concept for missing database operators - hash join • Benchmarking the execution system 23 / 37 Scientific Project David Broneske et al.
Topic 5 - Evaluating GPU-based Execution Models Intro • Query processing models for GPU: block-at-a-time & compiled execution • Block-at-a-time : execute data bulk/function after next in query pipeline • Compiled: generates execution code in runtime and execute them together • Not sure which execution model is most suitable for GPUs We’ve got • GPU accelerated DBMS - CoGaDB • Functionalities for b-a-a-t and compiled execution • Concepts for evaluating of models in stand-alone CPU Your Task • Literature Research: Execution models for heterogeneous hardware • Understanding execution of CoGaDB • Developing evaluation suite for GPUs • Implementation missing operations and the evaluation suite constructs • Critical analysis on the results and investigation on the resultant values 24 / 37 Scientific Project David Broneske et al.
Topic 6 - Indepth Semi-Structured Data Storage Intro • Management of semi-structured data (e.g., JSON) has become daily business • A widespread of formats for physical storage of JSON-like data exists today (e.g., PlainText, BSON, CBor, Avro, Parquet, FlatBuffers, MessagePack, Smile, UBJSON, Ion, Carbon) • A comprehensive comparison about design goals, concrete data model, limits and benefits, as well as operators, on an abstract, theoretical level is missing, though We’ve got • Carbon (Columnar Binary Json) implementation and specification as starting point • Running theses on evaluation of Carbon vs BSON Your Task • Literature Research: Semi-Structured Data Model (math model for JSON-like data), and existing evaluations for formats mentioned above → understanding on an abstract level • Research for existing comparisons, design goals, and applications of models above • Creation of common theoretical model and common taxonomy for formats above • List missing evaluations, top 5 formats, and come up with at least 5 further advanced and reasonable metrics not considered so far • Implementation of an evaluation framework with new metrics and missing evaluations as proof of concepts for at least 5 formats (PlainText, BSON, UBJSON, and Carbon excluded) 25 / 37 Scientific Project David Broneske et al.
Topic 7 - Order-By Queries in Elf Intro • Elf: multi-dimensional main memory index structure for efficient selections • Stores data sorting in a multi-dimensional order • Common data-intensive operator: Sorting We’ve got • Elf implementation in C++ Your Task • Literature Research: Related index structures and sorting algorithms • Understanding of the Elf and its optimization concepts • Implementation Sorting Operator for Elf • Performance evaluation against sequential scans 26 / 37 Scientific Project David Broneske et al.
ceptual design In summary, our index s nhepredicates following, Topic in 7 -weOrder-By explain theQueries basic design with the in Elf of a fixed height resulting i for efficient multi-column s the example data in Table II. The data set shows four proach s to be indexed and a tuple identifier (TID) that uniquely conceptual level. To further es each row (e.g., the row id in a column store). need to optimize the memor Beispieldaten C C C3 C4 ... TID Reo Martin 1Schäler2 rgan 0 1 0 1 ... T1 isati Karlsruhe Institute of Technology 0 2 0 0 ... T2 on B. Improving Elf’s memory martin.schaeler@kit.edu 1 0 1 0 ... T3 The straight-forward impl TABLE II. RUNNING EXAMPLE DATA and l shipdate < [DATE] + ’1 year’ structures used in other tree [DISCOUNT] 0.01 and [DISCOUNT] + 0.01 ANTITY] this creates an OLTP-optimi ig. 2, we depict the resulting Elf forElf-Datenstruktur Performanzgewinne the four indexed call InsertElf. To enhance O s of the example data from Table II. The Elf tree struc- an explicit memory layout, m Response Time in ms (1) 0 1 ps distinct 150 values of one column to DimensionLists Column C1 an array of integer values. F ecific level100in the tree. In the first column, there are 1 2 0 1 0 T3 (2) (3) assume that column values Column C2 inct values, 0 and 1. Thus, the first DimensionList, integer values. However, our Column + C3 ontains two50entries and one pointer for each entry. The 1 T1 0 T2 (4) 0 (5) 0 data type. Thus, we can also Column C4 ve pointer points to the beginning of the respective for values, which is the mos 3 q of the second column, L(2) and L(3) . Note, f sionList 6. El Se Q (c) D M rst two points share the same value in the first column, SI ) selectivity, and (c) response time of TPC-H 1) Mapping DimensionL erve .1 - Q6.3 a prefix table on Lineitem redundancy scale factor 100 elimination. In the second entries – in the following n , we cannot eliminate any prefix redundancy, as all of Elf, we use two intege e fact combinations neglected by thisinapproach 27 / 37 Scientific Project this column is David Broneske et al. are unique. As a result, performance impact for s ectivity of the multi-column selection
Topic 8 - Parallel Build of Elf Intro • Elf: multi-dimensional main memory index structure for efficient selections • Stores data sorting in a multi-dimensional order • Building involves a multi-dimensional sort → expensive We’ve got • Elf implementation in C++ Your Task • Literature Research: Parallelization strategies for building index structures • Understanding of the Elf and its optimization concepts • Implementation different parallelization strategies for Elf • Performance evaluation against standard implementation 28 / 37 Scientific Project David Broneske et al.
Topic 9 - GPU-Accelerated Selections in Elf Intro • Elf: multi-dimensional main memory index structure for efficient selections • Stores data sorting in a multi-dimensional order • Multi-core architectures demand for a clever parallelization strategy for GPUs We’ve got • Elf implementation in C++ • Parallel search and insert for CPU Your Task • Literature Research: Related parallelization strategies for index structures • Understanding of the Elf and its optimization concepts • Implementation GPU traversal variants for Elf • Performance evaluation against serial/CPU-parallel implementation 29 / 37 Scientific Project David Broneske et al.
Finding your Team Topics: • Topic 1 - Fragment Skipper (Gridformation) • Topic 2 - Deep hashing (Sim-Skip) • Topic 3 - Graph processing in a neural network (Malko) • Topic 4 - Cross-device data parallel query processing • Topic 5 - Evaluating GPU-based Execution Models • Topic 6 - Indepth Semi-Structured Data Storage • Topic 7 - Order-By Queries on Elf • Topic 8 - Parallel Build of Elf • Topic 9 - GPU-Accelerated Selections in Elf When do we meet for the programming test? 30 / 37 Scientific Project David Broneske et al.
Literature Research
How to Perform Literature Research • Efficient literature research requires Knowledge of Where to search Knowledge of How to search Finding adequate search terms Structured review of papers Knowledge of how to find information in papers 32 / 37 Scientific Project David Broneske et al.
Where to Search (I) • Different websites available that provide large literature databases 1. Google Scholar: http://scholar.google.de/ Key word and conrete paper search Often, PDFs are provided 2. DBLP: http://www.informatik.uni-trier.de/~ley/db/ Search for keyword, conferences, journals, author(s) BibTex and references to other websites 3. Citeseer: http://citeseerx.ist.psu.edu/about/site keyword, fulltext, author, and title search BibTex and (partially) PDFs are provided 33 / 37 Scientific Project David Broneske et al.
Where to Search (II) • Publisher sites are also a suitable target • ACM Digital Library: http://portal.acm.org/dl.cfm Keyword, author, conference/literature (proceedings), and title search Bibtex, mostly PDFs and other information are provided • IEEE Xplore: http: //ieeexplore.ieee.org/Xplore/guesthome.jsp?reload=true Similar to ACM, but only few PDFs Extended access within university network • Springer: http://www.springerlink.de/ Similar to previous Extended access within university Network • Further search possibilities: on author, research group or university 34 / 37 sites David Broneske et al. Scientific Project
How to Search Some hints to not get lost in the jungle • Use distinct keywords (fingerprint vs. fingerprint data) • Keep keywords simple (at most three words) • Otherwise, search for whole title • Read abstract (and maybe introduction) ⇒ decision for relevance First insights • Read abstract, introduction and background/related work (coarse-grained) to . . . get a first idea of the approach . . . find other relevant papers 35 / 37 Scientific Project David Broneske et al.
Information Retrieval Finding the required information • Read the paper carefully • Omit formal parts/sections • Try to classify (core idea, main characteristics) ⇒ develop classification/evaluation in mind • Understand the big picture • Make notes • Do NOT translate each sentence 36 / 37 Scientific Project David Broneske et al.
Finding your Team Topics: • Topic 1 - Fragment Skipper (Gridformation) • Topic 2 - Deep hashing (Sim-Skip) • Topic 3 - Graph processing in a neural network (Malko) • Topic 4 - Cross-device data parallel query processing • Topic 5 - Evaluating GPU-based Execution Models • Topic 6 - Indepth Semi-Structured Data Storage • Topic 7 - Order-By Queries on Elf • Topic 8 - Parallel Build of Elf • Topic 9 - GPU-Accelerated Selections in Elf 37 / 37 Scientific Project David Broneske et al.
You can also read