The Alan Turing Institute Internship Programme 2018 - Amazon AWS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The Alan Turing Institute Internship Programme 2018 Contents Project 1 – An interdisciplinary approach to programming by example……………………...2 Project 2 – Algorithms for automatic detection of new word meanings from social media to understand language and social dynamics……………………………………………………….5 Project 3 – High performance, large-scale regression…………………………………………..8 Project 4 – Design analysis and applications of efficient algorithms for graph based modelling……………………………………………………………………………………………10 Project 5 – Privacy-aware neural network classification & training…………………………. 12 Project 6 – Clustering signed networks and time series data……………………………….. 14 Project 7 – Uncovering hidden cooperation in democratic institutions…………………..…..17 Project 8 – Deep learning for object tracking over occlusion…………………………………20 Project 9 – Listening to the crowd: Data science to understand the British Museum visitors……………………………………………………………………………………………….22 1
Project 1 - An interdisciplinary approach to programming by example Project Goal To compare approaches to the versatile idea of ‘programming by example’, which has relevance in various different fields and contexts, and design interdisciplinary new techniques. Project Supervisors Adria Gascon (Research Fellow, The Alan Turing Institute, University of Edinburgh) Nathanaël Fijalkow (Research Fellow, The Alan Turing Institute, University of Warwick) Brookes Paige (Research Fellow, The Alan Turing Institute, University of Cambridge) Project Description Programming by example is a very natural and simple approach to programming: instead of writing a programme, give the computer a desired set of inputs and outputs, and hope that the programme will write itself out of these examples. In general, nothing prevents the computer from relying on training data, initiating an interactive dialogue with the user to resolve uncertainties, or even relying on the Internet, e.g. StackOverflow, to produce a solution that realises the user’s intent. A typical application is for an excel sheet: you write 2,4,6, and click on “continue”. You hope that the computer will output 8,10,12... Another application is for robotics, where Programming by Example is often called Programming by demonstration. The goal there is to teach robots complicated behaviours, not by hardcoding them, which would be too costly and complicated, but by showing a few examples and asking the robot to imitate. Automated program synthesis, namely having programs write correct programs, is a problem with a rich history in computer science that dates back to the origins of the field itself. In particular, the simple paradigm of “Programming by Example” has been independently developed within several subfields (at least formal verification, programming languages, and learning) under different names and with different approaches. This project is about 2
understanding the tradeoffs between these techniques, comparing them, and possibly devising one to beat them all. Programming by example can be seen as a concrete framework for program synthesis. In synthesis the specification for the programme is given by a high level specification, for instance a logical formula. The special case where only inputs and outputs are given is nonetheless pertinent in synthesis (see, for example https://dspace.mit.edu/openaccess- disseminate/1721.1/90876). Adria Gascon has a long experience on synthesis, in particular using SMT solvers. This will be one of the approaches to look at. Programming by example can be attempted by neural networks and probabilistic inference. There is some recent work in this direction which attempts to solve the program induction problem directly (see for instance https://arxiv.org/abs/1703.04990), as well as work which adopts deep learning as a way to provide assistance to SMT solvers (e.g. https://arxiv.org/abs/1611.01989). Brooks Paige is familiar with such approaches. This will be a second approach to look at. Programming by example can be seen as an automaton learning task. In this scenario, the goal is to learn a weighted automaton, which is a simple recursive finite-state machine outputting real numbers. There are powerful techniques for learning weighted automata, for instance through spectral techniques. Nathanaël Fijalkow has worked on these questions. This will be a third approach to look at. Besides studying their formal guarantees, we plan to empirically evaluate our algorithms, and hence the project will involve a significant amount of coding. Number of Students on Project: 2 Internship Person Specification Essential Skills and Knowledge • Interest in theoretical computer science in general • Interest in various computational models: automata, neural networks • Interest in programming languages 3
• Interest in interdisciplinarity (inside maths and computer science), as the different techniques to be understood and compared are rather diverse • Coding skills Desired Skills and Knowledge • Previous experience in SMT solving • Previous experience in NNs • Previous experience in automata learning Return to Contents 4
Project 2 – Algorithms for automatic detection of new word meanings from social media, to understand language and social dynamics. Project Goal To develop computational methods for identifying the emergence of new word meanings using social media data, advance understandings of cultural and linguistic interaction online, and improve natural language processing tools. Project Supervisors Barbara McGillivray (Research Fellow, The Alan Turing Institute, University of Cambridge) Dong Nguyen (Research Fellow, The Alan Turing Institute, University of Edinburgh) Scott Hale (Turing Fellow, The Alan Turing Institute, University of Oxford) Project Description This project focuses on developing a system for identifying new word meanings as they emerge in language, focussing on words entering English from different languages and changes in their polarity (e.g., from neutral to negative or offensive). An example is the word kaffir, which, starting from a neutral meaning, has acquired an offensive use as a racial or religious insult. The proposed research furthers the state of the art in Natural Language Processing (NLP) by developing better tools for processing language data semantically, and has impact on important social science questions. Language evolves constantly through social interactions. New words appear, others become obsolete, and others acquire new meanings. Social scientists and linguists are interested in investigating the mechanisms driving these changes. For instance, analysing the meaning of loanwords from foreign languages using social media data helps us understand the precise sense of what is communicated, how people interact online, and the extent to which social media facilitate cross-cultural exchanges. In the case of offensive language, understanding the mechanisms by which it is propagated can inform the design of collaborative online platforms and provide recommendations to limit offensive language where this is desired. 5
Detecting new meanings of words is also crucial to improve the accuracy of NLP tools for downstream tasks, for example in the estimation of the "polarity" of words in sentiment analysis (e.g. sick has recently acquired a positive meaning of 'excellent' alongside the original meaning of 'ill'). Work to date has mostly focused on changes over longer time periods (cf., e.g., Hamilton et al. 2016). For instance, awful in texts from the 1850s was a synonym of 'solemn' and nowadays stands for 'terrible'. New data on language use and new data science methods allow for studying this change at finer timescales and higher resolutions. In addition to social media, online collaborative dictionaries like Urban Dictionary are excellent sources for studying language change as it happens; they are constantly updated and the threshold for including new material is lower than for traditional dictionaries. The meaning of words in state-of-art NLP algorithms is often expressed by vectors in a low- dimensional space, where geometric closeness stands for semantic similarity. These vectors are usually fed into neural architectures built for specific tasks. The proposed project aims at capturing meaning change on a fine-grained, short time scale. We will use the algorithm developed by Hamilton et al. (2016), who used it to identify new meanings using Google Books. We will train in-house vectors on multilingual Twitter data collected from 2011 to 2017. Through this process we will identify meaning change candidates and evaluate them against the dictionary data by focussing on analysing the factors that drive foreign words to enter the English language and to change their polarity. In doing so, we will shed light on the extent to which the detected meaning changes are driven by linguistically internal rather than external (e.g. social, technological, etc.) factors. The original contributions of this research are: • The development of an NLP system for detecting meaning change occurring in a relatively short time period, so as to further the state of the art in NLP. • The design of an evaluation framework which compares automatically derived candidates for meaning change against dictionary data. • The analysis of subsets of such candidates to answer social science questions about the dynamics of human behaviour online. The specific tasks of this project are: a) Implement existing algorithms for identifying words that acquire new meanings as they appear in the English language using social media data from Twitter collected over a multiyear period (2011-2017). 6
b) Validate candidate words from Task (a) against Urban Dictionary and other dictionaries. c) Evaluate word meaning change in areas such as foreign loanwords and polarity change, and address research questions regarding cultural and linguistic exchanges online, as well as the creation and propagation of offensive language online. d) Prepare an article to be submitted to a journal or conference. Number of Students on Project: 2 Internship Person Specification Essential Skills and Knowledge • All interns will need to have advanced NLP skills, linguistic interest, and experience with working with large datasets and cloud computing. At least one of the interns should have some social data science experience. • Experience developing R packages would be beneficial although training can be provided. Desired Skills and Knowledge • Previous experience in SMT solving • Previous experience in NNs • Previous experience in automata learning Return to Contents 7
Project 3 – High performance, large-scale regression Project Goal To investigate distributed, scalable approaches to the standard statistical task of high- dimensional regression with very large amounts of data, with the ultimate goal of informing current best practice in terms of algorithms, architectures and implementations. Project Supervisors Anthony Lee (Research Fellow, The Alan Turing Institute, University of Cambridge) Rajen Shah (Turing Fellow, The Alan Turing Institute, University of Cambridge) Yi Yu (University of Bristol) Project Description The ultimate goal is to critically understand how different, readily available, large-scale regression algorithms/software and frameworks perform for distributed systems, and isolate both computational and statistical performance issues. A specific challenging dataset will also be included to add additional focus, and there is the opportunity to investigate more sophisticated, but less readily-available algorithms for comparison. This project aligns to the Institute’s strategic priorities in establishing leadership and providing guidance for common data analysis tasks at scale. It can feed in to a larger data science at scale software programme around performance and usability, which it is hoped will be developed in 2018. Phases: First phase: benchmark and profile available approaches on the Cray Urika-GX, and potentially other architectures, for a scalable example class of models with carefully chosen characteristics. Different regimes can be explored where there are substantial effects on performance. Second phase: use the benchmarks and profiling information to identify which, if any, recently proposed approaches to large-scale regression may improve performance, with the advice of Yi Yu and Rajen Shah. Third phase: apply the skills and software developed to a large and challenging data set. 8
Throughout the project, documentation will be written to enable other data scientists to perform large scale regressions with greater ease, and understand the implications of using different architectures, frameworks, algorithms, and implementations. This project is supported by Cray Computing. Number of Students on Project: 2 Internship Person Specification Essential Skills and Knowledge • Familiarity with a cluster computing framework for data science / machine learning, e.g., Spark • Basic statistical understanding of regression Desirable Skills and Knowledge • Some experience with high-performance computing Return to Contents 9
Project 4 - Design, analysis and applications of efficient algorithms for graph based modelling Project Goal To develop fast and efficient numerical methods for optimization problems on graphs, making use of continuum (large data) limits in order to develop multi-scale methods, with real-world applications in medical imaging and time series data. Project Supervisors Matthew Thorpe (University of Cambridge) Kostas Zygalakis (Turing Fellow, The Alan Turing Institute, University of Edinburgh) Carola-Bibiane Schönlieb (Turing Fellow, The Alan Turing Institute, University of Cambridge) Elizabeth Soilleux (University of Cambridge) Mihai Cucuringu (Research Fellow, The Alan Turing Institute, University of Oxford) Project Description Many machine learning methods use a graphical representation of data in order to capture the geometry it, in the absence of a physical model. If we consider the problem of classifying a large data set, say 107 data points, then one common approach is spectral clustering. The idea behind spectral clustering is to project the data onto a small number of discriminating directions where the data should naturally separate into classes. In practice one uses the eigenvectors of the graph Laplacian as directions and then uses off-the-shelf methods such as k-means for the clustering. More importantly, this methodology easily extends to the semi-supervised learning context. A bottleneck in the above approach is in the computation of eigenvectors of the graph Laplacian. The dimension of the graph Laplacian is equal to the number of data points and therefore becomes infeasible for large data sets. Our approach is to use continuum (large data) limits of the graph Laplacian to approximate the discrete problem with a continuum PDE problem. We can then use standard methods to discretise the continuum PDE problem on a potentially much coarser scale compared to the original discrete problem. In particular, instead of computing eigenvectors of the graph Laplacian, one would compute 10
eigenfunctions of the continuum limit of the graph Laplacian and use these instead. This should remove the bottleneck in spectral clustering methods for large data. The approach is amenable to multi-scale methods, in particular by computing coarse approximations and iteratively refining using known scaling results The project will start by implementing modifications of existing algorithms, in particular we will replace bottlenecks such as computing eigenvalues with an approximation based on continuum limits. Once we have a working algorithm we aim to take the project further by developing classification algorithms for diagnosing coeliac disease from medical images. In particular, using our algorithms, we aim to improve on the current state of the art methods of diagnosing coeliac disease (microscopic examination of biopsies), which is inaccurate with around 20%misclassification. Number of Students on Project: 1 Internship Person Specification Essential Skills and Knowledge • Good scientific computing skills, preferably in either Matlab or Python • Competence in basic linear algebra • Some functional analysis and PDEs • Strong communication skills Desirable Skills and Knowledge • Experience with implementing Bayesian methods Return to Contents 11
Project 5 - Privacy-aware neural network classification & training Project Goal To invent new encrypted methods for neural network training and classification. Project Supervisors Matt Kusner (Research Fellow, The Alan Turing Institute, University of Warwick) Adria Gascon (Research Fellow, The Alan Turing Institute, University of Warwick) Varun Kanade (Turing Fellow, The Alan Turing Institute, University of Oxford) Project Description Neural networks crucially rely on significant amounts of data to achieve state-of-the-art accuracy. This makes paradigms such as cloud computing and learning on distributed datasets appealing. In the former setting, computation and storage are efficiently outsourced to a trusted computing party, e.g. Azure, while in the latter, the computation of accurate models is enabled by aggregating data from several sources. However, because of regulatory and/or ethical reasons data can’t always be shared. For instance, many hospitals may have overlapping patient statistics which, if aggregated, could produce highly-accurate classifiers. However, this may compromise highly-personal data. This kind of privacy concern prevents useful analysis on sensitive data. To tackle this issue, privacy-preserving data analysis is an emerging area involving several disciplines such as statistics, computer science, cryptography, and systems security. Although privacy in data analysis is not a solved problem, many theoretical and engineering breakthroughs have made privacy-enhancing technologies such as homomorphic encryption, multi-party computation, and differential privacy related techniques into approaches of practical interest. However, such generic techniques do not scale to input sizes required for training accurate deep learning models, and custom approaches carefully combining them are necessary to 12
overcome scalability issues. Recent work on sparsifying neural networks and discretising the weights used when training neural networks would be suitable avenues to enable application of modern encryption techniques. However, issues such as highly non-linear activation functions and the requirement for current methods to keep track of some high-precision parameters may inhibit direct application. The project will focus on both these aspects: • Designing training procedures that use only low-precision weights and simple activation functions. • Adapting cryptographic primitives, such as those used in homomorphic encryption and multi-party computation, to enable private training on these modified training procedures. The ultimate goal of the project is to integrate both of these aspects into an implementation of a provably privacy-preserving system for Neural Network Classification & Training. Number of Students on Project: 2 Internship Person Specification Essential Skills and Knowledge • Interest in theoretical aspects of computer science • Knowledge of public-key cryptography (RSA, Pallier, GSW) • Knowledge of ML and NN (Residual Networks, Convolutional Networks) • Experience in implementing secure and/or data analysis systems • Experience in implementing distributed systems Desired Skills and Knowledge • Experience implementing cryptographic protocols • Experience implementing multi-party computation protocols Return to Contents 13
Project 6 - Clustering signed networks and time series data Project Goal To implement and compare several recent algorithms, and potentially develop new ones, for clustering signed networks, with a focus on correlation matrices arising from real-world multivariate time series data sets. Project Supervisors Mihai Cucuringu (Research Fellow, The Alan Turing Institute, University of Oxford) Hemant Tyagi (Research Fellow, The Alan Turing Institute, University of Edinburgh) Project Description Clustering is one of the most widely used techniques in data analysis, and aims to identify groups of nodes that exhibit similar features. Spectral clustering methods have become a fundamental tool with a broad range of applications in areas including network science, machine learning and data mining. The analysis of signed networks - with negative weights denoting dissimilarity or distance between a pair of nodes in a network - has become an increasingly important research topic in recent times. Examples include social networks that contain both friend and foe links, and shopping bipartite networks that encode like and dislike relationships between users and products. When analysing time series data, the most popular measure of linear dependence between variables is the Pearson correlation taking values in [−1, 1], and clustering such correlation matrices is important in certain applications. This proposal will develop k-way clustering in signed weighted graphs, motivated by social balance theory, where the task of clustering aims to decompose the network into disjoint groups. These will be such that individuals within the same group are connected by as many positive edges as possible, while those from different groups by as many negative edges as possible. We expect that the low-dimensional embeddings obtained via the various approaches we will investigate could be of independent interest in the context of robust dimensionality reduction in multivariate time series analysis. Of particular interest is learning nonlinear mappings from time series data which are able to exploit (even weak) temporal correlations 14
inherent in sequential data, with the end goal of improving out-of-sample prediction. We will focus on a subset of the following problems. (1) Signed Network Embedding via a Generalized Eigenproblem. This approach is inspired by recent work that relies on a generalised eigenvalue formulation which can be solved extremely fast due to recent developments in Laplacian linear system solvers, making the approach scalable to networks with millions of nodes. (2) Signed clustering via Semidefinite Programming (SDP). This approach relies on a semidefinite programming-based formulation, inspired by recent work in the context of community detection in sparse networks. We efficiently solve the SDP program efficiently via a Burer-Monteiro approach, and extract clusters via minimal spanning tree-based clustering. (3) An MBO scheme. Another direction relates to graph-based diffuse interface models utilizing the Ginzburg-Landau functionals, based on an adaptation of the classic numerical Merriman-Bence-Osher (MBO) scheme for minimizing such graph-based functionals. The latter approach bears the advantage that it can easily incorporate labelled data, in the context of semi-supervised clustering. (4) Another research direction is along the lines of clustering time series using Frechet distance. The existing algorithm in the literature is quite complicated and not directly implementable in practice. It essentially involves a pre-processing step where each time series is replaced with its lower complexity version via its “signature”. This leads to a faster algorithm for clustering (in theory). The approach via signatures could prove powerful, and one could consider forming the signature via randomized sampling of the “segments” of the time series. (5) Graph motifs. This approach relies on extending recent work on clustering the motif/graphlet adjacency matrix, as proposed recently in a Science paper by Benson, Gleich, and Leskovec. (6) Spectrum-based deep nets. A recent approach in the literature focuses on fraud detection in signed graphs with very few labelled training sample points. This problem and its setup are very similar to the topic of an ongoing research grant “Accenture and Turing 15
alliance for Data Science”, using network analysis tools for fraud detection, that could benefit from any algorithmic developments that would take place during the internship. The approach proposes a novel framework that combines deep neural networks and spectral graph analysis, by relying on the low-dimensional spectral coordinates (extracted by our approaches (1) - (5) detailed above) as input to deep neural networks, making the later computationally feasible to train. (1), (2), (3) already have a working MATLAB implementation available, that could be built upon and compared to (4), (5). Time permitting, (6) can also be explored. There will be freedom to pursue any subset of the above topics that align best with the candidates’ background and maximise the chances of a publication. A strong emphasis will be placed on assessing the performance of the algorithms on real- world, publicly available data sets arising in economic data science, meteorology, medicine monitoring or finance. Number of Students Project: 2 Internship Person Specification Essential Skills and Knowledge • Both students will have familiarity with the same programming language (either R, Python, or MATLAB) • Solid knowledge of linear algebra and algorithms • Familiarity with basic machine learning tools such as clustering, linear regression and PCA Desirable Skills and Knowledge • Desirable but not needed: basic familiarity with spectral methods, optimization, nonlinear dimensionality reduction, graph theory, model selection, LASSO/Ridge regression, SVMs, NNs Return to Contents 16
Project 7 - Uncovering hidden cooperation in democratic institutions Project Goal To generalise the method of Vote-Trading Networks, previously developed to study hidden cooperation in the US Congress, to a wider set of democratic institutions, developing a research programme in the measurement and characterisation of hidden cooperation on a large scale. Project Supervisors Omar A Guerrero (Research Fellow, The Alan Turing Institute, University College London) Ulrich Matter (University of St Gallen) Dong Nguyen (Research Fellow, The Alan Turing Institute, University of Edinburgh) Project Description The project aims at improving our understanding of cooperation in democratic institutions. In particular, it will shed new light on cooperative behaviour that is intentionally ‘hidden’. An example of such hidden cooperation is when two legislators agree to support each other’s favourite bills, despite their ideological preferences, and/or despite such support being disapproved of by their respective voters or campaign donors. This kind of behaviour is key to the passage or blockage of critical legislation; however, we know little about it due to its unobservable nature. The objective of this project is to exploit newly available big data on voting behaviour from different institutional contexts and state-of-the-art methods from data science, in order to develop two distinct research papers with clear policy implications for the design and evaluation of political institutions. Political institutions, such as parliaments and congresses, shape the life of every democratic society. Hence, understanding how legislative decisions arise from hidden agreements has direct implications on the guidelines that governments follow when conducting policy interventions. Moreover, decision making by voting is common in other areas than legislative law-making. It is prevalent in courts, international organizations, as well as in board rooms of private enterprises. The supervisors have collected comprehensive data sets on two institutions; the US Supreme Court and the United Nations General Assembly. Each intern will work on one institution, using 17
the data provided by the supervisors and, sometimes, collecting complementary data (through web scraping). The work conducted on the two institutions will share a set of tools and methods, but also have unique requirements. In order to streamline the workflow, the internship will be structured in three phases. Every week, there will be a group meeting where each intern will give a presentation of his or her progress. This will be an opportunity to share ideas, questions, challenges and solutions that the interns have experienced. It will also serve to evaluate progress and adjust goals and objectives. In addition, the documentation of their progress will be the basis for a final report to be handed on the last week. Phase 1: Introduction (1 to 1.5 weeks) The interns will receive an introduction to the topic of cooperation in social systems, with a particular focus on political institutions and situations in which cooperation is intentionally hidden, such as vote trading, and, hence, unobservable in real-world data. Some specifics about this phase are the following: • Introduction to vote trading in democratic institutions, its societal relevance, evidence, measurements and challenges. • Introduction to web scraping and text mining. • Tutorial on network science. • Tutorial on stochastic and agent-based models. • Tutorial on the Vote-Trading Networks framework. Phase 2: Work with Data (3 to 4 weeks) In this phase, the interns will conduct independent work to prepare their datasets and perform statistical analysis to understand its structure. The supervisors will provide the ‘core’ datasets, which will then be processed, pruned and analysed by the interns. Preparation work varies depending on the project. The intern working with US Supreme Court data will apply natural language processing (NLP) techniques to a large set of raw text documents, and then, match the extracted information to voting records. Given the nature of the problems related to NLP, this work could require substantially more time than the UN project. Hence, the goals and timelines for this project will be adjusted according to progress. The intern working with UN data will extend a web scraper, previously developed by the supervisors, in order to download data from the UN Library on resolutions, and match it to voting data from the UN General Assembly. 18
Once the data sets have been prepared, the interns will conduct statistical analysis. This will serve the group to get a better understanding about the composition of the population, their characteristics, voting patterns, voting outcomes, etc. Phase 3: Computational Analysis (rest of the internship) In this phase, the interns will bring together their understanding about institutions, ideas behind hidden cooperation, data sets and computational methods. The interns will write up their results, with the goal of publishing two distinct research articles. Number of Students on Project: 2 Internship Person Specification Essential Skills and Knowledge • Knowledgeable in the Python and/or R programming language • Familiar with statistical concepts such as random variables, probability distributions and hypothesis testing • Experience working with empirical micro-data Desirable Skills and Knowledge • Knowledgeable about political institutions and economic behaviour • Familiar with complexity science and complex networks • Familiar with agent-based modelling and Monte Carlo simulation Return to Contents 19
Project 8 - Deep learning for object tracking over occlusion Project Goal To use deep learning to discover occluded objects in an image. Project Supervisors Vaishak Belle (Turing Fellow, The Alan Turing Institute, University of Edinburgh) Chris Russell (Turing Fellow, The Alan Turing Institute, University of Surrey) Brooks Paige (Research Fellow, The Alan Turing Institute, University of Cambridge) Project Description Numerous applications in data science require us to parse unstructured data in an automated fashion. However, many of these models are not human-interpretable. Given the increasing need for explainable machine learning, an inherent challenge is whether interpretable representations can be learned from data. Consider the application of object tracking. Classically, algorithms simply track the changing positions of objects across frames. But in many complex applications, ranging from robotics to satellite images to security, objects get occluded and thus disappear from the observational viewpoint. The first task here is then to learn semantic representations for concepts such as "inside", "behind" and "contained in." The first supervisor (V. Belle) has written a few papers on using probabilistic programming languages to define such occlusion models -- in the sense of instantiate them as graphical models -- and use that construction in particle filtering (PF) problems, and decision-theoretic planning problems. However, the main barrier to success here was these occlusion models need to be defined carefully by hand by a human, which makes them difficult to deploy in new contexts. The main challenge of this internship is to take steps towards automating the learning of these 20
occlusion models directly from data. Specifically, the idea is to jointly train a state estimation model -- specifically a particle filter (PF) -- with a background vision segmentation model so that we can predict the next position of an occluded object. The second supervisor (C. Russell) has extensive experience in vision and segmentation who will serve as the principal point of contact at the ATI for the interns. (The first supervisor will also make a continuous visit during the initial stages.) We will focus on using variational auto encoders, recurrent neural nets or other relevant deep learning architectures such as sum product networks to enable to the learning of semantic representations. For instantiating deep learning architectures, B. Paige will be contributing his recent approaches to integrate the learning framework with PyTorch and/or Pyro, the latter recent proposed by UBER. For the data, we plan on using 2 kinds of data sets. From the object tracking community, we will be using tracking videos and clips to annotate occluded objects to train the models. (Russell's working relationship with our new partner QMUL gives us direct access to their tracking expertise, and sports and commuter tracking datasets. In consultation with them, we intend to apply PF-RNN to these problems.) The expected outcome is the following: a learned model M such that for any clip C where object O gets occluded at some point in C, a query about the position of O against M would correctly identify that O is occluded and where it's position is, based on the velocity of O’s movement and its position the last time it was occluded. Number of Students on Project: 2 Internship Person Specification Essential Skills and Knowledge • Background in machine learning and deep learning • Preferably background in handling image data • Background in sum product networks or pytorch, pyro, etc. would be beneficial Return to Contents 21
Project 9 – Listening to the crowd: Data science to understand the British Museum visitors Project Goal To analyse and understand the British Museum visitors’ behaviour and feedback, using different sets of data including the Trip Advisor feedback, the Wifi access and “intelligent counting data, and methods such as natural language processing and time series analysis. Project Supervisors Taha Yasseri (Turing Fellow, The Alan Turing Institute, University of Oxford) Coline Cuau (British Museum) Harrison Pim (British Museum) Project Description There is more to The British Museum than Egyptian mummies and the Rosetta Stone - more than 6 million people walk through the doors each year, travelling from every corner of the globe to see the Museum's collection and get a better understanding of their shared histories. Those visitors offer us a unique test bed for data science and real world testing at scale. In order to address some of the challenges of welcoming such a large number of visitors, the British Museum is constantly gathering feedback and information about the visiting experience. Research about visitors informs decisions made by teams around the Museum and help the Museum evolve along with its audience. The tools at the museum’s disposal include direct feedback channels (such as email or comment cards), “intelligent counting data, wifi data, audio guide data, social media conversations, satisfaction surveys, on-site observation and conversations on online review sites such as Trip Advisor. Trip Advisor reviews are one of the largest and richest qualitative datasets the Museum has access to. On average, over 1,000 visitors review their visit on the platform every month. These reviews are written in over 10 languages by visitors from all parts of the world, and historical data stretches back over two years. In these comments, visitors discuss the 22
positive and negative aspects of their visits, make recommendations to others, and rate their satisfaction. The data set is an opportunity for the Museum to learn more about its visitors, to understand what the most talked about topics are, and which factors have the biggest impact on satisfaction. This research project aims to dig into a rich set of qualitative data, uncovering actionable insights which will have a real impact on the Museum. The research will have an immediate and tangible effect and will help the organisation improve the visiting experience currently on offer at the Museum. The Museum is currently undergoing pivotal strategic change, and the insights will also feed into future iterations of the display and audience strategies. As far as we know, the British Museum is the first institution of its kind to take a programmatic approach to this kind of qualitative data. This pioneering research could potentially impact the rest of the cultural sector and show the way to a new method of evaluation and visitor research. Some of the questions we hope to answer with this data are: • Understanding satisfaction – what it means, how it affects propensity to recommend, and which aspects of a visit have the biggest impact on overall satisfaction. • Analysing the different topics talked about in different languages. Do positive and negative experiences vary according to language? • Analysing which parts of the collection or objects visitors talk about the most, and how feedback differs from one area of the Museum to another. • Tracking comments regarding a variety of key topics, and understanding how they relate to one another (tours and talks, audio guides, access, facilities, queues, overcrowding…). • Understanding and anticipating external factors which might impact decisions made to visit (economy, weather, security concerns, strikes, politics…). The Museum has recently set up a partnership with Trip Advisor, which gives us access to the reviews in an XML format. This file includes the date and URL of the reviews, as well as their title, score, language and full review text. The Museum could take a manual approach to tagging and analysing reviews, but we believe that more insight can be generated through computational approaches. 23
The proposed research project will therefore involve heavy use of modern Natural Language Processing (NLP) techniques. The complete corpus of review text consists of approximately 7,500,000 words in 50,000 distinct reviews. Recent advances in machine learning and NLP provide a wide range of potential approaches to the subject, but suggested methods include: • Topic modelling • Clustering/classifying reviews by topic or sentiment • word2vec style approaches to training/using word embeddings • Automating the tagging of new reviews • Time series analysis and principal component analysis Number of Students on Project: 2 Internship Person Specification Essential Skills and Knowledge • Familiarity with large scale data analysis • Experience in scientific programming (R or Python) • Interest in natural language processing techniques • Interest or past experience with advanced statistics methods such as time series analysis and PCA Desirable Skills and Knowledge • Interest in culture and museums and familiarity with context of the project Return to Contents 24
You can also read