Scientific Project: Databases for Multi-dimensional Data, Genomics and modern Hardware - David Broneske, Gabriel Campero, Bala Gurumurthy, Andreas ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Scientific Project: Databases for Multi-dimensional Data, Genomics and modern Hardware David Broneske, Gabriel Campero, Bala Gurumurthy, Andreas Meister, Marcus Pinnecke, Roman Zoun, Gunter Saake October 13, 2017
Overview I Concepts of this course I Overview of project topics & forming project teams I Course of action (milestones, presentations) I How to perform literature research? David Broneske et al. Scientific Project 2
Overview I Concepts of this course I Overview of project topics & forming project teams I Course of action (milestones, presentations) I How to perform literature research? I Further lectures: I Academic writing (2-3 lectures) David Broneske et al. Scientific Project 2
Organization
Scientific Project: Course Grading Bachelor I Module: WPF FIN SMK (Schlüssel- und Methodenkompetenzen) I 5 CP = 150h ⇒ 42h presence time (3 SWS) + 108h autonomous work Master I Module: Scientific Team Project I 6 CP = 180h ⇒ 42h presence time (3 SWS) + 138h autonomous work Grade at the end of the course for the whole project team David Broneske et al. Scientific Project 4
Scientific Project: Course Grading II I Weighting the Grade: I Presentations: 30%, I Implementation: 30%, I Paper: 30%, I Soft Skills: 10% I Binding registration: Second Milestone David Broneske et al. Scientific Project 5
Scientific Project: Prerequisite I Successful programming test in C++/Java I 1h theoretical test in a seminar room (data and place to be discussed) I Half of the team members have to pass the test I Topics: I David Broneske et al. Scientific Project 6
Scientific Project: Semester Plan Monday Tuesday Wednesday Thursday Friday Introduction 09.10.2017 10.10.2017 11.10.2017 12.10.2017 13.10.2017 16.10.2017 17.10.2017 18.10.2017 19.10.2017 20.10.2017 23.10.2017 24.10.2017 25.10.2017 26.10.2017 27.10.2017 MS-I 30.10.2017 31.10.2017 01.11.2017 02.11.2017 03.11.2017 Aca-I 06.11.2017 07.11.2017 08.11.2017 09.11.2017 10.11.2017 Aca-II 13.11.2017 14.11.2017 15.11.2017 16.11.2017 17.11.2017 MS II 20.11.2017 21.11.2017 22.11.2017 23.11.2017 24.11.2017 27.11.2017 28.11.2017 29.11.2017 30.11.2017 01.12.2017 04.12.2017 05.12.2017 06.12.2017 07.12.2017 08.12.2017 MS III 11.12.2017 12.12.2017 13.12.2017 14.12.2017 15.12.2017 18.12.2017 19.12.2017 20.12.2017 21.12.2017 22.12.2017 Winter Break ... ... ... ... ... 08.01.2018 09.01.2018 10.01.2018 11.01.2018 12.01.2018 15.01.2018 16.01.2018 17.01.2018 18.01.2018 19.01.2018 MS Final 22.01.2018 23.01.2018 24.01.2018 25.01.2018 26.01.2018 David Broneske et al. Scientific Project 7
Scientific Project: Milestones I Milestone I - Topic, schedule, and team presentation & first results of literature research I Milestone II - Concept & additional literature research I Milestone III - Implementation & evaluation setup I Milestone IV - Final presentation (wrap-up + evaluation results) David Broneske et al. Scientific Project 8
Concepts & Content
Lecture, Meetings & Presentation Lecture & Presentation I Time/Place: Friday, 9:00-11:00, G29 - room K058 I Lectures with content of course → all I Presentation of main milestones (see time table) → each project team Meetings (Exercise) I Individual for each project team I Time and room to be agreed in project teams! I Presentation of all intermediate results/milestones (informal) I Discussion, discussion, discussion . . . David Broneske et al. Scientific Project 10
Objectives & Qualification (I) Acquired skills, specific to research I Performing literature research I Understanding and structured reviewing of scientific work I Autonomous, solution-based reasoning on research task (e.g., finding alternative solutions) I How to ask? How to adapt a task (extend/reduce)? I Academic writing David Broneske et al. Scientific Project 11
Objectives & Qualification (II) Acquired skills, always needed I Team management I Project and time scheduling I Presentation of results I Flexibility regarding changing conditions I Reasoning about solutions (”Why is this the best/not adequate. . . ”) David Broneske et al. Scientific Project 12
Progress of Course Deliveries I 4 milestone presentations (main milestones) I Each team member has to present at least once I Reporting of (sub) milestones in exercises/meetings I Written paper about literature research (technical report) I Management report I Prototypical implementation David Broneske et al. Scientific Project 13
Deliveries and Grading (I) Technical Report I Delivery of report at a given time (deadline) I Quality/Quantity of literature research I Number of pages I Quality of paper structure and evaluation I Own contribution David Broneske et al. Scientific Project 14
Deliveries and Grading (II) Management Report I Description of project realization (timeline, milestones) I Separation of roles and contributions of single team members I Meeting protocols I Self-evaluation of member and group work (strengths, weaknesses) David Broneske et al. Scientific Project 15
Deliveries and Grading (III) Presentation & Discussion I Quality of scientific presentation (structure, references, time) I Assessment regarding the content (e.g., results of particular milestones) I Participation of discussion Organization I Strictness I Communication (just-in-time answers, satisfying time constraints) I Self-organization (Sharing tasks, internal reporting of current state-of-work, dealing with problems) I Autonomous working David Broneske et al. Scientific Project 16
Task & Time Management Task Management I Main milestones have to be finished in time I (Sub) milestones are less strict (but don’t be sloppy) I Pre-defined work packages ⇒ each project team I . . . defines sub work packages I . . . determines responsibilities for these packages (divide&conquer) Time Management I Planning of periods I Regarding capacities and resources I Considering other tasks and activities I Reporting of delays immediately to project members ! David Broneske et al. Scientific Project 17
Role Management I Possible roles: team leader, developer, researcher, . . . I work together vs. responsibilities: design, implementation, testing, writing, . . . I Delegate for important roles/work packages I Assignment of (sub) tasks to role for each milestone David Broneske et al. Scientific Project 18
Topic & Project Teams I Teams with 3 to 5 students (depends on the task) I Most tasks can be chosen once I Projects I Theoretical part I State of the art I New ideas I Practical part I Usually in C++ or Java I Prototypical implementation I Evaluation part David Broneske et al. Scientific Project 19
Topic 1 - Join-Order Optimization Intro I Join Order Optimization needed for efficient query processing → NP-hard problem I For 3-5 Bachelor/Master Students Task I What are common techniques? (Top-Down-Approaches ...) I What are used optimization within Top-Down-Approaches? I Prototypical implementation using C++ I Tune algorithms to performance (e.g., using a profiler) I Evaluate their performance and draw conclusions I Compare to other algorithms David Broneske et al. Scientific Project 20
Topic 2 - GridFormation.FPM Intro In the GridStore (our novel, in-house storage engine) a table is I partitioned into grids. Grids can include replicated & non-consecutive parts of the table. They can also overlap. I Collaborate with us in applying frequent pattern mining (FPM) to optimize data partitioning for the GridStore. I For 3-5 Bachelor/Master Students Your Task I Literature Research: FPM, Data Partitioning, Flexible Storage Engines. I Prototypical implementation and tuning of an FPM strategy for data partitioning. This prototype should incorporate novel features relevant to the use case (e.g. add biases towards some partitions, penalties for replication, or others). I Experimental evaluation & suggestion of improvements. David Broneske et al. Scientific Project 21
Topic 3 - Workload Forecasting w/TensorFlow Intro DBMSs should be able to use reliable workload forecasts, for I optimizing their performance. I Recurrent Neural Networks can produce such forecasts. I What are the challenges and opportunities for integrating efficient RNN forecasting into a DBMS? I For 3-5 Bachelor/Master Students Your Task I Literature research: Deep Learning for forecasting. I Prototypical implementation and tuning of a RNN for forecasting database workloads, using TensorFlow. I Evaluate the performance of the model (incl. GPU vs CPU), the quality of predictions and how a prototypical storage engine could benefit from them. I Suggest improvements and outline limitations. David Broneske et al. Scientific Project 22
Topic 4 - SIMD-Accelerated Hash Based Aggregation Intro Hashing can be used to perform grouped aggregation queries I While probing, insertion could be necessary I I Can be optimized using SIMD I For 3-5 Bachelor/Master students We’ve got I Framework for performing hash insertion while probing I Cuckoo hashing and linear probing with SIMD Your Task I Literature Research: Find additional hashing techniques I Understand working of these hashing techniques I Invent a concept to perform SIMD on hashed aggregation I Implementation of SIMD accelerated hash based aggregation I Performance comparison of new hashing techniques with existing ones David Broneske et al. Scientific Project 23
Elf I Multi-column selection predicates in DWH applications Overhead when scanning several columns I → Elf: Multi-dimensional main-memory index structure1 I Tree structure with fixed search path depending on # of columns I Prefix-redundancy elimination I Storage optimizations 1 www.elf.ovgu.de David Broneske et al. Scientific Project 24
Topic 5 - Cold Data Avoidance for Elf Intro I Cold data traversal for queries on a little amount of columns I Worst case: Mono-column selection predicates I For 3-5 Bachelor/Master students We’ve got I Elf implementation Your Task I Literature Research: Related index structures and cold data management I Understanding of the Elf and its optimization concepts I Implementation of Elfs for Mono-column selection predicates, Pointers into TID lists I Performance evaluation of the variants I Investigate ratio of storage overhead and performance gain David Broneske et al. Scientific Project 25
Topic 6 - SIMD for Elf Intro I Elf nodes similar to B-tree nodes I Zeuch et al. introduced SIMD for B-tree I For 3-5 Bachelor/Master students We’ve got I Elf implementation Your Task I Literature Research: SIMD for tree structures I Understanding of the Elf and its optimization concepts I Implementation of SIMD Elf and its performance evaluation David Broneske et al. Scientific Project 26
Topic 7 - Sort Queries in Elf Intro I Sorting is a data-intensive task I Elf stores data already sorted → column order determines effectiveness I For 3-5 Bachelor/Master students We’ve got I Elf implementation Your Task I Literature Research: Sorting queries on partially indexed data I Understanding of the Elf and its optimization concepts I Implementation of additional sortings for Elfs I Performance evaluation compared to a sort from scratch David Broneske et al. Scientific Project 27
Finding your Team Topics: I Topic 1 - Join-Order Optimization I Topic 2 - GridFormation.FPM I Topic 3 - WFTF!- Workload Forecasting w/TensorFlow (&RNNs) I Topic 4 - SIMD-Accelerated Hash Based Aggregation I Topic 5 - Cold Data Avoidance for Elf I Topic 6 - SIMD for Elf I Topic 7 - Sort Queries in Elf David Broneske et al. Scientific Project 28
Literature Research
How to Perform Literature Research I Efficient literature research requires I Knowledge of Where to search I Knowledge of How to search I Finding adequate search terms I Structured review of papers I Knowledge of how to find information in papers David Broneske et al. Scientific Project 30
Where to Search (I) I Different websites available that provide large literature databases 1. Google Scholar: http://scholar.google.de/ I Key word and conrete paper search I Often, PDFs are provided 2. DBLP: http://www.informatik.uni-trier.de/˜ley/db/ I Search for keyword, conferences, journals, author(s) I BibTex and references to other websites 3. Citeseer: http://citeseerx.ist.psu.edu/about/site I keyword, fulltext, author, and title search I BibTex and (partially) PDFs are provided David Broneske et al. Scientific Project 31
Where to Search (II) I Publisher sites are also a suitable target I ACM Digital Library: http://portal.acm.org/dl.cfm I Keyword, author, conference/literature (proceedings), and title search I Bibtex, mostly PDFs and other information are provided I IEEE Xplore: http://ieeexplore.ieee.org/Xplore/ guesthome.jsp?reload=true I Similar to ACM, but only few PDFs I Extended access within university network I Springer: http://www.springerlink.de/ I Similar to previous I Extended access within university Network I Further search possibilities: on author, research group or university sites David Broneske et al. Scientific Project 32
How to Search Some hints to not get lost in the jungle I Use distinct keywords (fingerprint vs. fingerprint data) I Keep keywords simple (at most three words) I Otherwise, search for whole title I Read abstract (and maybe introduction) ⇒ decision for relevance First insights I Read abstract, introduction and background/related work (coarse-grained) to I . . . get a first idea of the approach I . . . find other relevant papers David Broneske et al. Scientific Project 33
Information Retrieval Finding the required information I Read the paper carefully I Omit formal parts/sections I Try to classify (core idea, main characteristics) ⇒ develop classification/evaluation in mind I Understand the big picture I Make notes I Do NOT translate each sentence David Broneske et al. Scientific Project 34
Finding your Team Topics: I Topic 1 - Join-Order Optimization I Topic 2 - GridFormation.FPM I Topic 3 - WFTF!- Workload Forecasting w/TensorFlow (&RNNs) I Topic 4 - SIMD-Accelerated Hash Based Aggregation I Topic 5 - Cold Data Avoidance for Elf I Topic 6 - SIMD for Elf I Topic 7 - Sort Queries in Elf When do we meet for the programming test? David Broneske et al. Scientific Project 35
You can also read