Student Conference 2019 Topics - Gunter Saake, Jacob Krüger

Page created by Pedro Dominguez
Student Conference 2019

      Gunter Saake, Jacob Krüger
Database Operation Tuning
                                                             (David Broneske)
A current trend in database systems is to tune algorithms at a very fine granularity.
Current code optimizations are controversially discussed, but a clear applicability of
them is missing. Consequently, discuss the applicability of a subset of available
code optimizations on selected database algorithms.

• Bogdan Raducanu, Peter Boncz, Marcin Zukowski: Micro Adaptivity in Vectorwise
• Jingren Zhou, Kenneth A. Ross: Implementing Database Operations Using SIMD Instructions
• John L. Hennessy, David A. Patterson: Computer Architecture -- A Quantitative Approach
Database Operations on Modern Processing
Devices                   (David Broneske)
Tuning database operations to the underlying hardware is a hot topic with the
increasing usage of co-processors. There are numerous publications involving
different algorithms and processing devices. Create a survey regarding database
operations on different processing device.

• Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, Dinesh Manocha: Fast Computation of
  Database Operations Using Graphics Processors
• Rene Müller, Jens Teubner, Gustavo Alonso: Data Processing on FPGAs
• Thomas Willhalm, Yazan Boshmaf, Hasso Plattner, Nicolae Popovici, Alexander Zeier, Jan
  Schaffner: SIMD-Scan: Ultra Fast in-Memory Table Scan using on- Chip Vector Processing Units
Evolution of column-oriented RDBMS operations
                               (Bala Gurumurthy)
Current trend in RDBMS is moving towards close-to-metal re-implementation of
typical DBMS operations for underlying hardware. With the availability of newer
features (like multi-core, SIMD) as well as device architectures (GPU, FPGAs) in the
hardware landscape researches are done in tuning the operations to adapt to the
hardware. In this work, we would survey the evolution of DBMS operations with
reference points for the newer hardware availabilities. The work, in the end,
provides a view on the hardware landscape with changes being applied to the
DBMS operations and also the areas of dense and sparse researches.

•   GPU-Accelerated Database Systems: Survey and Open Challenges - Sebastian Breß
•   Accelerating SQL database operations with CUDA - Peter Bakkum
•   Relational co-processing in graphics processors - Bin Sheng He
•   Implementing Database Operations Using SIMD Instructions - J Zhou
GPU Cache management techniques for data
processing environment     (Bala Gurumurthy)
Due to limited cache space in a GPU, not all the input data can be processed and stored in
GPU. As an alternative, hot input data buffers are proposed to be stored in a GPU for
further processing without transfer overhead. In this work, we will look into the issue of
caching in GPU and list the possible alternatives for caching in a GPU. Since column cannot
be directly stored within a GPU, we look for alternative representation of data that is still
sufficient for performing database operations over them (like bitmap, position list etc.)
Overall, the work presents the state of the art techniques in intermediate representation
for storing column in a GPU as well as the buffer management techniques used for caching
in GPU.

•   Waste Not.. Efficient Co-Processing of Relational Data - Holger Pirk
•   In-cache query co-processing on coupled CPU-GPU architectures - Jiong He
•   Efficient Data Management for GPU Databases - Peter Bakkum
•   Techniques for Caches in GPUs - Guenther Schindler
Paving the way from game theory to cooperative
DB components        (Gabriel Campero Durand)
Research in economy and game theory is ripe with models that seek to understand how agents
compete for resources and how, through market design, they can be encouraged to collaborate,
converging to optimal allocations for the group. In data management research there have been
some attempts to adopt these models, for example in creating marketplaces for data fragmentation
in the Mariposa Stream Processing System. However, this is not widely adopted. With the
development of agent-based machine learning solutions for data management, it is possible that
these techniques will gain relevance. In this topic we aim to start with a quick review on economic
and game theory concepts, followed by a careful collection and discussion of related work. We
conclude by proposing, based on discussions, potential applications in storage engine management,
query processing and cloud computing.

• Pastine, Ivan and Pastine Tuvana. Introducing Game Theory: A Graphic Guide. Icon Books Ltd, 2017.
• Marcus, Ryan, Olga Papaemmanouil, Sofiya Semenova, and Solomon Garber. "NashDB: An End-to-End
  Economic Method for Elastic Database Fragmentation, Replication, and Provisioning." In Proceedings of
  the 2018 International Conference on Management of Data, pp. 1253-1267. ACM, 2018.
• Pentaris, Fragkiskos, and Yannis Ioannidis. "Autonomic query allocation based on microeconomics
  principles." In 2007 IEEE 23rd International Conference on Data Engineering, pp. 266-275. IEEE, 2007.
Multi-agent deep reinforcement learning and
databases          (Gabriel Campero Durand)
The success of single agent deep reinforcement learning naturally creates
interest in evolving to multiagent solutions, like DeepMind's AlphaStar. These
are specially interesting since they address realistic use cases, where agents
are not in entire control of a system. In this conference topic we will
categorize the state of the art in the field, highlighting challenges and
potentials in some approaches. In addition, we take a deep dive into one
mature approach. We conclude by considering the feasibility of applying
such approach to a database task.

• Database Query Optimization with Deep Reinforcement Learning:
• Hernandez-Leal, Pablo, Bilal Kartal, and Matthew E. Taylor. "Is multiagent deep reinforcement
  learning the answer or the question? A brief survey." arXiv preprint arXiv:1810.05587 (2018).
Learning from demonstrations with deep
reinforcement learning (Gabriel Campero Durand)
Though reinforcement learning is a useful online method, it is often infeasible to train
agents by interacting with a real-world system. Moreover, simulated environments are
costly to produce. Thus, training agents in an offline manner, by using traces from an
expert interacting with the system, is particularly compelling for practitioners. In this
student conference topic we study in detail such approach. We consider how it has been
used (or proposed to be used) in recent data management cases, and we list existing
frameworks for off-the-shelf learning from demonstrations.

• Hester, Todd, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan et al. "Deep q-
  learning from demonstrations." In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
• Schaarschmidt, Michael, Alexander Kuhnle, Ben Ellis, Kai Fricke, Felix Gessert, and Eiko Yoneki. "LIFT: Reinforcement
  Learning in Computer Systems by Learning From Demonstrations." arXiv preprint arXiv:1808.07903 (2018).
• Marcus, Ryan, and Olga Papaemmanouil. "Towards a Hands-Free Query Optimizer through Deep Learning." arXiv
  preprint arXiv:1809.10212 (2018).
• Gauci, Jason, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, and
  Xiaohui Ye. "Horizon: Facebook's Open Source Applied Reinforcement Learning Platform." arXiv preprint
  arXiv:1811.00260 (2018).
Machine learning on networks and Graph-based
recommenders         (Gabriel Campero Durand)
Graph databases are a special kind of general data management system optimized for network-
oriented analytical queries and storage. They are mainly developed to support a specific
representation of a graph, namely property graphs. However, recent trends require further features
from these databases, either to support novel data representations (embeddings) or highly efficient
feature engineering processes. In this seminar topic we aim to study some of these trends, by
considering one of two applications: machine learning on networks, or graph-based recommenders.
For the chosen domain we describe carefully the domain, we take a detailed look at a given example
study, and we outline the implications for system development.

• Cao, Yixin, Xiang Wang, Xiangnan He, and Tat-Seng Chua. "Unifying Knowledge Graph Learning and Recommendation: Towards a
  Better Understanding of User Preferences." arXiv preprint arXiv:1902.06236 (2019).
• Hodler, Amy E., and Needham, Mark. "Graph Algorithms". O'Reilly Media, Inc. May 2019. ISBN: 9781492047681
• Mutlu, Ece C., and Toktam A. Oghaz. "Review on Graph Feature Learning and Feature Extraction Techniques for Link Prediction."
  arXiv preprint arXiv:1901.03425 (2019).
• Eksombatchai, Chantat, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet, Mark Ulrich, and Jure Leskovec.
  "Pixie: A system for recommending 3+ billion items to 200+ million users in real-time." In Proceedings of the 2018 World Wide
  Web Conference on World Wide Web, pp. 1775-1784. International World Wide Web Conferences Steering Committee, 2018.
Interests in Systematic Software Reuse
                                    (Jacob Krüger)
Systematic software reuse in terms of software product lines is often only
introduced after a larger set of different variants has evolved. For varying reasons,
including cost reduction, faster development, or improved management, these
variants are merged and integrated into a platform (reverse engineering). While
there are several case studies that report on the migration processes and
experiences, we still need a detailed analysis of the actual industrial motivations
that lead to the adoption of product lines. To this end, we aim to analyze various
topics on the adoption of software product lines at different venues and in
different years. Topics may include the motivation and costs for extracting features,
the evolution of software, or synchronizing independent variants at SPLC, ICSE, or

• A defined selection of topics, venues, and years can be defined to scope the extent of the analysis
• Rabiser, R., Schmid, K., Becker, M., Botterweck, G., Galster, M., Groher, I., Weyns, D. (2018). A study and
  comparison of industrial vs. academic software product line research published at SPLC. International
  Conference on Systems and Software Product Line. 14-24. ACM.
Automated Test Refactoring
                                                                   (Jacob Krüger)
Software is regularly updated or refactored, for example, to remove errors,
introduce new features, or migrate towards a new technology. However, any
change in the productive software also means that corresponding test cases
may break or are not sufficient anymore. The purpose of this survey is to
identify and summarize existing techniques on automated test case
refactoring, meaning techniques that track code changes and support
developers in maintaining the test cases for these artifacts.

• Peng-Hua Chu, Nien-Lin Hsueh, Hong-Hsiang Chen, and Chien-Hung Liu. 2012. A Test Case
  Refactoring Approach for Pattern-Based Software Development. Software Quality Journal
• Arie van Deursen, Leon Moonen, Alex van den Bergh, and Gerard Kok. 2002. Extreme
  Programming Perspectives. Chapter Refactoring Test Code
How do We Forget?
                                                                             (Jacob Krüger)
Understanding a program is an essential activity in software engineering and the
research area of program comprehension is extensively investigated. However,
most studies are concerned with recovering understanding of a program and how
to improve code design for this purpose. Such processes resemble learning of
artifacts. In contrast, the process of forgetting in software engineering is rarely
investigated. With this project, we aim to provide an overview on existing studies
that are concerned with forgetting in software engineering and what factors affect
developers' memory.

• Krüger, J., Wiemann, J., Fenske, W., Saake, G., Leich, T. (2018). Do you remember this source code?.
  International Conference on Software Engineering. 764-775. IEEE.
• Fritz, T., Murphy, G., Hill, E. 2007. Does a Programmer?sActivity Indicate Knowledge of Code? Joint Meeting
  of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations
  ofSoftware Engineering. ACM, 341?350.
• Kang, K., Hahn, J. (2009). Learning and Forgetting Curves in Software Development: Does Type of Knowledge
  Matter? International Conference on Information Systems.
Cloud-based Protein Identification
                                                                                          (Roman Zoun)
Mass spectrometers are devices to digitize real world samples with growing success on the
market. The technology sequences proteins to identify protein biomarkers of biological
environments, such as oceans, humans, or microbial communities which are used in the
research fields proteomics, metaproteomics and metabolomics. These biomarkers are
similar to a fingerprint and can be used to identify the sample data. Due to the fast quality
upgrades of the mass spectrometer, they produce ever-increasing amounts of data,
resulting in terabytes of output data by a single machine. The analysis step, so called
protein identification, is used to bring insights into the sample data. The protein
identification is now a big data problem.
Task: Find protein identification solutions which use big data technology and map them to
the big data landscape.

• R. Millioni, C. Franchin, P. Tessari, R. Polati, D. Cecconi, and G. Arrigoni. Pros and cons of peptide isolectric focusing in
  shotgun proteomics. Journal of chromatography. A, 1293:19, June 2013.
• R. D. Bjornson, N. J. Carriero, C. Colangelo, M. Shifman, K.-H. Cheung, P. L. Miller, and K. Williams. X!!tandem, an improved
  method for running x!tandem in parallel on collections of commodity computers. Journal of Proteome Research, 7(1):293–
  299, 2008. PMID: 17902638
You can also read