Big Data Praktikum SS 2018 - Universität Leipzig, Institut für Informatik Abteilung Datenbanken Prof. Dr. E. Rahm - Abteilung Datenbanken Leipzig
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Big Data Praktikum SS 2018 Universität Leipzig, Institut für Informatik Abteilung Datenbanken Prof. Dr. E. Rahm
Organisation Ziel: Entwurf und Realisierung einer Anwendung / eines Algorithmus unter Verwendung existierender Big Data Frameworks Ablauf Anwesenheitspflicht der Gruppe zu allen Testaten Bis Anfang Mai Erstes Treffen mit Betreuer (Terminanfrage per Mail) Ende Mai Testat 1: System kennenlernen / Datenimport / Lösungsskizze Mitte/Ende Juli Testat 2: Implementierung und Ergebnisse vorstellen Anfang August Testat 3: Präsentation 15 Minuten pro Gruppe Anwesenheitspflicht aller Praktikumsteilnehmer
Technische Details Quellcode: GitHub Repository Gruppe => Collaborators Werden nach Praktikum zu https://github.com/leipzig-bigdata-lab geforked Java: Apache Maven 3 für Projekt Management Test Driven Development erwünscht Siehe Dokumentation zu Unit Tests in jeweiligen Frameworks Quellcode Dokumentation zwingend erforderlich! Stabile Versionen verwenden (ggf. Rücksprache) z. B. Flink 1.4.2 Lokal lauffähige Lösungen können auf dediziertem Cluster ausgeführt werden Terminabsprache Anfang Juli mit franke@informatik.uni-leipzig.de Datensätze z. B. https://github.com/caesar0301/awesome-public-datasets
PPRL with Bloom Filters Projects: 1.) Analyzing different BitSet Implementations for Bloom-Filter-based PPRL 2.) Analyzing different lengths of Bloom Filters 3.) Analyzing XOR-Folding for Bloom Filters
Privacy-Preserving Record Linkage (PPRL) Find records in different databases that refer to the same real world object No disclosure of sensitive personal information
BitSet Implementations for Bloom Filters Problems: Different BitSet implementations usable as basis for Bloom Filter java.util.BitSet OpenBitSet boolean[] No or outdated benchmarks Task: Development of three Bloom Filter implementations Performance benchmark (runtime, memory) Proof of claims, e. g. “OpenBitSet is faster than java.util.BitSet in most operations and *much* faster at calculating cardinality of sets and results of set operations.” Technologies: Java Apache Flink
Lengths of Bloom Filters Problems: PPRL Applications use given length of Bloom Filters for encoded records (usually 1000) Better performance is expected with shorter Bloom Filters But how does the length of the Bloom Filter effect the quality of the results? Task: Encoding of given data sets with different parameters: Lengths of Bloom Filters Number of hash functions Evaluation of quality (recall, precision) of PPRL processing based on parameters Exploration of practical boundaries Technologies: Java Apache Flink
Analyzing XOR-Folding for Bloom Filters Ziad Sehili
XOR-Folding for Bloom Filters Problems: Goal of PPRL is to hide personal data in the matching process by encoding the fields in a Bloom Filter. BUT some cryptanalysis methods can disclose original data The main weakness of Bloom Filters is the frequency of some tokens (“er” is a frequent bigram many bloom filters will have some same position set to 1). Is it possible to hide or obfuscate these frequencies by XOR-folding the Bloom Filter? How is the impact of the folding operation on the linkage quality? Task: Implementation of some folding operations Evaluation of quality (recall, precision) Technologies: Java
OSTMap Open Source Tweet Map Matthias Kricke & Martin Grimmer
OSTMap - Open Source Tweet Map • https://github.com/IIDP/OSTMap • OSTMap development started as a project at the IT-Ringvorlesung 2016. • A team of six students (and some help of two big data experts) implements OSTMap over a period of 6 weeks. • OSTMap reads geotagged data from the twitter stream. • We store tweets in a hadoop cluster running Apache Accumulo and Apache Flink.
Efficient Termindex for Twitter Data and Trend Visualization • Part 1: • Currently the term search supports lookups for exactly one term eg. „bigdata“ • We want to support fast queries like: „the“ „white“ „house“ • Key word: Document-Partitioned Indexing • Part 2: • We want to visualize current trends… • With their geographic distribution and • Their temporal spread.
Sentiment Analysis for Twitter Data • Part 1: • Use of Java-based libraries for in-stream sentiment analysis of twitter data • Batch-based sentiment analysis, e.g. with SparkMLlibs Naïve Bayse Classifier • Write data to a table for sentiment analysis results for each approach • Part 2: • Build a frontend in OSTMap for users to decide the sentiment of randomly drawn tweets are done • Use the information for quality analysis of sentiment analysis procedures and visualize the results in OSTMap
Polyglot DB Johannes Zschache
Polyglot DB • Verschiedene Anwendungen erfordern versch. Typen von Datenbanken: Relational, Key-Value, Document, Graph, … • In der Praxis: Gleichzeitige Verwendung versch. Typen • Vorteil: Optimale DB für jeden Anwendungsfall • Beispiel: • Relational: Sicherheit, homogene Daten • Key-Value: Schneller Zugriff, einfache Datenstruktur • Document: Flexibles Schema, Suchfunktionen • Graph: Beziehungen, Traversal • Aufgabe: Welchen Vorteil hat die Verwendung einer Graphdatenbank gegenüber einer Dokumenten- Datenbank?
Anwendung • Yelp Dataset • Dokument-DB: MongoDB • Infos zu Unternehmen • Speicherung der Reviews • Suche nach Kategorie • Geospatial Query • Empfehlungen: ähnliche Restaurants, z.B. Welche Restaurants wurden vom selben Reviewer gleich gut/schlecht bewertet? • Graphdatenbank (Neo4j) schneller als MongoDB? • Trotz Synchronisation?
Bolt-on causal consistency Johannes Zschache
Bolt-on causal consistency • Kausale Konsistenz ist Kompromiss zwischen sequentieller Konsistenz und Eventual Consistency • Reihenfolge der Operationen wird eingehalten, aber beschränkt auf kausal verbundene Operationen (happened-before relation) • Weniger Koordination erhöht Verfügbarkeit • Stärkste Konsistenz, welche Verfügbarkeit (insb. Schreib-Operationen) trotz Netzwerkpartitionierung erlaubt • Nur wenige NoSQL-DB unterstützen kausale Konsistenz • Bolt-on = Clientseitige Umsetzung • Paper: Bailis et al (2013), http://www.bailis.org/papers/bolton- sigmod2013.pdf • Prototype (github): Java, Cassandra Aufgabe • Umsetzung mit JavaScript, PouchDB und CouchDB
Creation and visualization of temporal graphs Christopher Rost
Creation and visualization of temporal graphs „Graphs are everywhere“: friendship networks on Facebook, community interactions at Stackoverflow, video-likes and channel-abo‘s on YouTube, citation networks Real-world graphs change over time – additions, deletions and updates of edges, vertices and their properties Much work done to analyse and visualize static graphs „How communities evolve over a specific time range?“ „At which time the number of citations is growing rapidly? Did other citations influence that?“ [1] Aynaud, Thomas & Fleury, Eric & Guillaume, Jean-Loup & Wang, Qinna. (2013). Communities in Evolving Networks: Definitions, Detection, and Analysis Techniques. Modeling and Simulation in Science, Engineering and Technology. 2. 159-200. 10.1007/978-1-4614-6729-8_9.
Creation and visualization of temporal graphs Tasks Create a temporal EPGM from a network dataset Query graph data by time range Visualize the graph in an interactive web application Size of Graph Stackoverflow temporal network dataset: 2,601,977 nodes [3] 63,497,050 edges 1990 Now [2] A. Beveridge and J. Shan, „Network of Thrones“ Math Horizons Magazine , Vol. 23, No. 4 (2016), pp. 18-22. [3] Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. "Motifs in Temporal Networks." In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 2017.
FastText on Spark Victor Christen
FastText on Spark Word2Vec • Words are represented by a vector • Trained by a large corpus considering context of words Skip-gram Model
FastText on Spark Issues • Unknown words in test corpus Missing fuzzy component Solution • FastText • Using n-gram sequences for representing words • Utilized to generate embeddings even for words that are not included in the vocabulary Task • Understanding FastText • Representation of words • Neural Network • Implementation with DeepLearning4j
Distributed FastText on TensorFlow Issues • Unknown words in test corpus Missing fuzzy component Solution • FastText • Using n-gram sequences for representing words • Utilized to generate embeddings Task • Distributed Implementation of FastText on TensorFlow
Farberkennung von Produkten Eric Peukert
Farberkennung von Produkten • Zur Identifikation von Duplikaten in Produktkatalogen können Bilder sehr hilfreich sein • Ziel: • Extraktion der Farbinformation von Produktbildern • Segementierung und Annotation von Vorder und Hintergrund – ggf- andere Kategorien • Technologie: • Convolutional Neuronal Networks (nutzbar z.B. über TensorFlow) • Daten • 90000 Produktbilder der WebDataSolutions GmbH
Analytics of BitCoin Transaction Data Eric Peukert
Analytics of BitCoin Transaction Data • Parsen der Bitcoin Blockchain • Verarbeitung von Updates durch neue Transaktionen • Erstellung eines Graphen in Gradoop • Analyse mittels Gradoop Analytical • Max 2 Studenten • mit guten Java Programmierkenntnissen Workflows • Flink-Erfahrung oder VL Cloud Data Management als Voraussetzung
Webgraph Analysis with GRADOOP Moritz Wilke
Webgraph Analysis with GRADOOP • commoncrawl.org: three-monthly snapshots of web graph on host-level • Questions: • How is the{University of Leipzig, Bach Digital project} interlinked with other institutions and research projects? • How did this change over time? • Are there interesting structures or missing links (e.g. triangle closing)? • Tasks: • Data Import to GRADOOP, Preprocessing • Data exploration • Development of analytical questions • Data Analysis with GRADOOP operators • Visualization / Reporting
Thema FW #Studenten Betreuer PPRL: Analyzing different BitSet 2 Implementations for Bloom-Filter-based Java / Apache Flink Franke PPRL PPRL: Analyzing different lengths of Bloom 2 Java / Apache Flink Gladbach Filters PPRL: Analyzing XOR-Folding for Bloom 2 Java / Apache Flink Sehili Filters Java / Apache Flink / 2 OSTMap: Efficient Termindex for Twitter Apache Accumulo / Grimmer Data and Trend Visualization JavaScript Java / Apache Flink / 2 OSTMap: Sentiment Analysis for Twitter Apache Accumulo / Kricke Data JavaScript Creation and visualization of temporal Java / Apache Flink / 2 Rost graphs Gradoop / JavaScript Polyglot DB Java, MongoDB, Neo4j 2 Zschache JavaScript, CouchDB, 2 Bolt-on causal consistency Zschache PouchDB FastText on Spark Spark, DeepLearning4j 2 Christen Distributed FastText on TensorFlow TensorFlow 2 Alkhouri Farberkennung von Produkten TensorFlow 2 Peukert Java / Apache Flink / 2 Analysis of the BitCoin-Blockchain Peukert Gradoop Java / Apache Flink / 2 Webgraph Analysis Wilke
You can also read