Communication Analysis through Visual Analytics: Current Practices, Challenges, and New Frontiers - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Communication Analysis through Visual Analytics: Current Practices, Challenges, and New Frontiers Maximilian T. Fischer, Frederik L. Dennig, Daniel Seebacher, Daniel A. Keim, and Mennatallah El-Assady III Visual Interface Properties of the representation Features of the interaction I IV Particularities of the refinement process Input: Data and Information Knowledge Generation Information about the message and data Type of output Context and participant information Evaluation and factors for verification II Properties / strategies of the environment Intended knowledge gain and power arXiv:2106.14802v1 [cs.HC] 28 Jun 2021 Processing and Models Analytical goals of the approach Scope of the models used Processing methods and properties Fig. 1. Four main dimensions of our conceptual framework for communication analysis systems along with their characterization, which is based on the visual analytics process model by Keim et al. [40] and adapted by us to the communication analysis domain. Abstract— The automated analysis of digital human communication data often focuses on specific aspects like content or network structure in isolation, while classical communication research stresses the importance of a holistic analysis approach. This work aims to formalize digital communication analysis and investigate how classical results can be leveraged as part of visually interactive systems, which offers new analysis opportunities to allow for less biased, skewed, or incomplete results. For this, we construct a conceptual framework and design space based on the existing research landscape, technical considerations, and communication research that describes the properties, capabilities, and composition of such systems through 30 criteria in four analysis dimensions. We make the case how visual analytics principles are uniquely suited for a more holistic approach by tackling the automation complexity and leverage domain knowledge, paving the way to generate design guidelines for building such approaches. Our framework provides a common language and description of communication analysis systems to support existing research, highlights relevant design areas while promoting and supporting the mutual exchange between researchers. Additionally, our framework identifies existing gaps and highlights opportunities in research areas that are worth investigating further. With this contribution, we pave the path for the formalization of digital communication analysis through visual analytics. Index Terms—Communication analysis, visual analytics, conceptual framework, design space, state-of-the-art. 1 I NTRODUCTION Human communication has been fundamentally transformed within nication (in the following: communication analysis systems). For the the digital era. While digital forms of communication have slowly in- purpose of this paper, we define these systems as largely automated creased in the second half of the last century, the pace has fundamentally applications that also employ visual components to interactively ana- accelerated within the last two decades. Nowadays, digital communica- lyze digital human communication. Most commonly, this can be chats, tion is the norm rather than the exception, offering cost-effectiveness e-mail, etc. We do not consider approaches that focus primarily on the and instant access anywhere, leading to a change in communication content itself (e.g., only sentiment analysis on a document collection behavior [57] – possibly permanently. alone), but the approach must thereby focus on distributed information With this transformation to digital communication [70], new research exchanges by several participants. Consequently, public exchanges opportunities have emerged in a wide variety of different domains, rang- (e.g., Twitter or forums), even though they are somewhat related, do ing from engineering to social sciences to business. For example, it not fit directly as all information is available to all participants directly has been studied how visualization can show the evolution of dynamic after a post (similar to a single group communication). communication networks [75], how discourse analysis for digital com- For such communication analysis systems, we notice a peculiar munication can be enhanced [34], or how team performance in business oddity: As we highlighted in previous work [18], and summarize in settings can be evaluated based on communication behavior [20]. For this paragraph shortly, the research into such systems is split into two such analyses, digital analysis methods are often used to aid and support fractions: the vast majority of existing approaches focus on either the the (semi-)manual, domain-specific research methodologies. content of communication or on the network aspect in isolation in- In this paper, we focus on the field of interactive human communi- stead of taking into account the fundamental dynamics holistically. In cation analysis and specifically on automated and interactive commu- contrast, this distinction between content and network is not present nication analysis systems that primarily target written human commu- in the seminal works on human communication research [55, 72, 78]. Neither does it occur in more recent textbooks [62] or in current com- munication research [20, 54]. Even when digital methods are used to • Maximilian T. Fischer, Frederik L. Dennig, Daniel Seebacher, Daniel A. aid the manual analysis, for example, in psychology, often a holistic Keim, and Mennatallah El-Assady are with the University of Konstanz. view is taken. Indeed, the holistic analysis of the network structure, E-mail: {max.fischer, frederik.dennig, daniel.seebacher, keim, the communication patterns as well as the content plays an integral mennatallah.el-assady}@uni-konstanz.de. part [68, 78] in this research field. As such, the analysis of the con- This is a preprint and pre-peer-reviewed version of this article. tent or the network/metadata aspect alone can sometimes lead to an incomplete or even biased view, skewing the analysis results. Using
independent approaches often introduces discontinuities, makes tasks 2 BACKGROUND more complicated, requires more manual work, and makes the detection Communication analysis can use a variety of different techniques to of cross-matches more difficult. However, the development of such analyze communication behavior. Like communication itself is an approaches is difficult as human communication data can be very com- interactive exchange between sender and receiver, its analysis is not plex, consisting of meta-data, network relations, and content, including bound by a strict set of instructions. It builds upon the iterative analysis explicit and implicit connotations, with some meaning only evident and progressive understanding of the meaning of the communication when considering everything together and in context. Nevertheless, this and its context, which makes automation difficult. As such, the area of comparisons highlights a gap in such communication analysis systems. semi-automated communication analysis fits very well with the concept We have begun to address these shortcoming in previous work [18] of visual analytics, which is also based on the concept of iterative and by presenting a prototype for one of the first holistic approaches to interactive knowledge generation. In the following, we first describe the communication analysis. previous work in human sensemaking using visual analytics principles However, the research in such systems has not yet received enough and how it integrates with communication analysis as well as the origins interest yet, in part due to the fact that the area is not yet accurately and main concepts of communication analysis itself. and systematically described. The need for such a formalization has been recognized [76] in communication sciences, but so far the oppor- 2.1 Visual Analytics and Knowledge Generation tunities, challenges, and pitfalls have primarily been described from Visual analytics [40] describes the concept of combining human ana- a application-oriented perspective in the social sciences. While an lytical reasoning with computational data analysis through interactive effort has recently been made to map digital communication systems visualizations to gain knowledge. This knowledge generation process as a whole [19], the focuses was largely on content, infrastructure, has been formalized by Sacha et al. [66] by explicitly describing the and policy aspects, while leaving out the analysis and technical as- human cognition as three circular loops that build on each other: (1) pects. The same is true for the classical communication analysis re- exploration, (2) verification, and (3) knowledge generation. In the first search [55, 62, 72, 78]. One can base considerations on these historic loop (exploration), the user explores the data and the initial results the works, but on the one hand it involves many aspects with no direct system presents. In communication analysis, this could, for example, counterpart in digital communication. For example, facial expressions encompass tasks such as finding the communication participants most and gestures are not retained and accessible for the recipient of written central in a network or find those messages which contain specific exchanges. On the other hand and on a more fundamental level, digital semantic concepts. In the second loop (verification), a confirmatory communication has also transformed the way we communicate and analysis is conducted to either confirm or reject a hypothesis. Specific the modalities we use [59] Digital communication has its own partic- tasks here could include comparison between different results, further ularities and characteristics, which are, naturally, not discussed in the inspection, or aggregation. At last, in the third loop (knowledge gener- seminal works. However, when we instead focus on approaches de- ation), the user gains new knowledge and insights about the data and scribing actual systems including the technical aspects, many findings the analysis. Every step in this iterative process is essential, as it allows from communication research like the importance of a holistic view the analysts to form connections between different analysis modalities. are not heeded. Instead, the vast majority of approaches fall into the This is especially important, because these different modalities are very two fractions discussed above, focusing either on concepts from natural common in communication analysis [39], which makes visual analytics language processing [53] for content analysis or discussing the field of especially suited for a holistic, non-specialized analysis. social network analysis [71] for the network aspects. In this work, we want to bridge the gap between communication 2.2 Communication Analysis research and modern communication analysis system development. As The origins of communication research can be traced backed to ancient evident in a few academic works [18, 28, 42, 79] and recent commercial times, with the study of rhetoric and oratory as well as persuasion in systems [13, 58, 60], visualization and interactive user steering is a ancient Greece and Rome. However, the formal study of communica- promising way [33] to begin to tackle the gap between the two different tion as a process did not really start until the early 20th century. Of analysis modalities. So far, the field of communication analysis systems particular importance in this early phase are the works of Cooley [11] is not clearly defined from a scientific, technical perspective, lacking a and Lippmann [48], which provided one of the first formal definitions common language to describe it. Consequently, so far, it has not been of communication and established the influence of social aspects in systematically explored how visual analytics principles are – and could communication, as well as the work of Simmel [72], studying social be – employed in communication analysis, how such systems can be relationships and their influence. These human interrelations were described and categorized, for example, by their data, methodology, and further studied by Moreno [55], who is often credited with laying the information handling, and what a relevant taxonomy would look like. foundations for social network analysis (SNA). These early works were This lack of formalization makes developing such systems complicated then further extended by Bavelas [5] and Leavitt [46], who introduced a and hinders comparison between approaches. taxonomy for communication patterns in groups. Only a few years later, As part of this work, we aim to formalize digital communication Savage and Deutsch [68] paved the way to analyze communication analysis by constructing a design space of interactive digital communi- through computer aided modeling and processing. However, manual cation analysis systems, making the following contributions: analysis and research was still the leading methodology. In the coming years, Watzlawick et al. [78] developed a theory of communication, while other work [69] further described the particularities of human • The creation of a conceptual framework of communication anal- communication. Around this time, the field was further extended by ysis systems, based on communication research, technical consid- Roger [65] and adapted to the modern understanding of communication erations, and systematically reviewing the properties of existing networks. Recent works highlight the main findings [62]. approaches and the associated tasks. With the advent of digital processing and computational power came • A survey and comparison of existing approaches to assess their the shift from laboursome manual analysis [25] first to digitally sup- maturity and coverage, using our framework. ported [50] and later to highly automated analysis. All three levels are • A discussion on the open challenges and a list of future research fully justified and cover important areas in their fields. However, it is opportunities on communication analysis systems. noticeable, while not completely surprising, that the completeness of the analysis developed in the opposite direction of automation level, with solutions being ever more specialized and focusing on just specific Our work fills the gap in the formalization of communication analy- aspects the more they are automated. For example, modern systems sis systems. With this contribution, we aim to better guide the research allow us to analyze communication behavior and social ties using cen- when assessing and developing approaches for communication analysis trality measures [52] or describe complete artificial networks in social using visual analytics principles. sciences [6]. Specialized toolkits have been developed to analysis such
network structures, like Pajek [4] or Gephi [3]. All these approaches primarily focus on the network aspects, omitting most of the meta- Automatic 1285 Manual 58 Manual 44 Manual 37 f Filtering Filtering Coding Validation data and especially the content. Other approaches instead focus only on it, like approaches that use keyword-based searches [81] to filter communication for certain content, or those which aim to improve the Fig. 2. The paper collection and coding process. It consists of four understanding of the meaning of communication using concepts such main steps: (1) Automated filtering, (2) manual filtering, (3) manual as sentiment analysis [61] or topic modeling [64]. coding, and (4) manual validation. From a visual perspective, several approaches primarily follow a node-link-diagram-based approach, like Gephi [3], and many commer- cial solutions like IBM’s i2 Analyst’s Notebook [38], Pajek [4], Palantir Venue #Coll. #Filtered #Coded #Final Gotham [60], DataWalk [13], and Nuix Discover and Nuix Investi- gate [58]. Another class of approaches uses matrix-based approaches IEEE TVCG 790 35 27 23 to analyze the communication relations, for example, MatrixExplorer Computer Graphics Fo. 495 17 11 8 [31] or NodeTrix [32]. Another set of approaches uses timeline designs Commercial Systems - 6 6 6 like CloudLines [43], while others like Fu et al. [24] modify graph pre- Total 1285 58 44 37 sentations through multiple planes. For a detailed discussion, we refer to previous work [18], where we discuss the existing visual analytics Table 1. Publications per venue and paper count for collection step. research landscape for communication analysis systems extensively. 3 M ETHODOLOGY The goals and methods employed in communication analysis are diverse these approaches would go beyond the scope of this paper, given its and manifold, as illustrated by the broadness of existing approaches. size constraints. In the following, we aim towards tackling the central question of a To broaden the perspective, in a second step, we consulted with eight common description of communication analysis: How can the different domain experts about approaches used in practice. The experts be- approaches in communication analysis systems be described within a long to the field of law enforcement, working for various European law common, conceptual framework? enforcement agencies, and each has extensive experience with digital Framework Basis — We propose to base such a framework on three investigations, including communication analysis, working in the field areas of consideration: (1) Communication research offers decades from ten to over 30 years. They contributed a collection of six actual of research on the particularities of (often non-technical) communica- systems [3, 4, 13, 38, 58, 60] applied in the field (including commercial). tion analysis. For this work, we integrate concepts primarily from later Those approaches were included in the Selection Methodology from summary works [62, 69, 78]. We discuss the concepts and ideas which Step 2 onward and were not simultaneously discovered during the seed are based on communication research in detail in the relevant parts paper selection. However, some approaches [3, 38] can be considered of Section 4. (2) Technical considerations of possible approaches al- to be universally known in the community. low for production-based selection criteria, taking into account design Selection Methodology — For the actual paper selection method- properties such as analyzable data types, data representation, and flow, ology, we follow a four-step approach (see also Figure 2 and Table 1). or limitations like scalability from a technical standpoint. These are First, we conducted a keyword-based seed search for the words commu- closely linked to the last category. (3) The existing research landscape nication and analysis on the titles, abstract, index terms, and contents of interactive communication analysis systems provides a foundation of publications in each of the venues described above. Secondly, we for the classification of approaches based on measures such as analyt- went through all these papers’ titles and abstracts manually, discarding ical goals, visualization and interaction methods, or the power of the those which clearly are not concerned with communication analysis knowledge generation process. While the first two considerations are systems, reducing the selection significantly. In this step, we included references directly when describing the framework in Section 4, the approaches suggested by the domain experts. Third, we manually third requires a thorough preparation: looked at the remaining papers and decided if they indeed describe Seed Papers for Existing Research Landscape — To analyze a communication analysis system. In the final step, we validated the state-of-the-art interactive communication analysis systems and con- results by checking borderline cases, consequently removing seven tribute a first set of classification criteria, we conducted a seed literature papers. Or final collection includes 37 approaches. survey to get an initial idea of a large set of existing approaches. First, we collected papers using a keyword-based search, and as we 4 C ONCEPTUAL F RAMEWORK are considering only approaches from visual data analysis, restrict this search to the following high-quality journals and conferences: In the following section, we aim to construct a possible conceptual framework that encompasses discerning aspects of communication • IEEE Transactions of Visualization and Computer Graphics analysis systems. As with any taxonomy, the development took dozens (TVCG, including IEEE VIS proceedings) of iterations, and we cannot cover its complete evolution here. Instead, • Computer Graphics Forum (including EuroVis and EuroVA) we highlight the main considerations and, for each property, give a justification. For the complete conceptual framework, see Figure 3). In We consider approaches published in the last 15 years (from 2006 to the following, we focus on the argumentation and reasoning, categoriz- 2021) to focus on recent developments for all venues. Note that we ing the existing approaches just in an exemplary manner. For the full intentionally restricted our search space to these venues, considering classification of the selected approaches from above, consult Table 2. only well known information visualization publications. The goal of this paper is not to construct a complete survey of approaches – we aim Main Considerations When designing a conceptual framework, for a large and diverse enough sample to make justifiable conclusions the structuring methodology is of central importance. One standard (for a more detailed discussion and a possible future survey, see also methodology is to use a task-based grouping. However, when analyzing Section 5). Indeed, we are well aware that other valid approaches ex- the different approaches, we realized that sometimes very different ists, being published in other venues. For example, we did not include methods are employed for the same task. For example, to discover key papers from human–computer interaction issues (e.g., CHI [22, 83]). persons within a communication network, one can either use an SNA- Also, the survey on text visualization by Kucher et al. [44], especially based approach using node-link visualizations and centrality measures the group discourse analysis, provides some further approaches that or instead leverage a geometric deep learning model that is interactively have been published in IUI [36, 37] or TiiS [21]. Further, other venues visualized using a matrix-based method. These methods have very like PacificVis [24, 83] or domain specific journals like Digital Inves- different side effects, visualization, and interaction techniques and tigation [28] also contributed valid approaches. The inclusion of all make very different assumptions and requirements on the data.
The primary goal is to design a conceptual framework of commu- sentiment analysis) while judging support for explicit (e.g., keyword- nication analysis in visual analytics from a communication theoretical based search like [38]) and also implicit (e.g., named entity recognition standpoint, but one which can be easily adapted and applied to such ap- like [15]) content. Most approaches consider the explicit level, and proaches. Therefore, we base our framework on the established visual several, especially text-based ones, also consider the implicit level. analytics process model by Keim et al. [40] to cover the main areas. We slightly modify the terminology, proposing four main dimensions, 4.1.2 Communication Participants which are characterized further in Figure 1: (I) Input: Data and Infor- A second, very important aspect in communication research, is the mation (4.1), (II) Processing and Models (4.2), (III) Visual Interface relationship of the communication participants. It is usually analyzed (4.3), and (IV) Knowledge Generation (4.4). both from the perspective of scope or by their power relationship. In our work, we develop these dimensions specifically for commu- Target Scope — The target scope can be oneself (e.g., in a note to nication analysis. To achieve this, we include properties relevant in oneself), a single person, or a group of persons. The analysis of the communication research as well as discerning technical properties, existing approaches shows that most systems focus on a single target. based on the existing research landscape. In the following, we discuss Power Relationship — It comes again from classical communica- each dimension individually. As a note of caution, we stress that many tion research [78], which proposes a distinction into symmetrical (equal properties we present in the following are themselves multifaceted and, grounds) or complementary (dependence). The relationship context is for a detailed analysis, benefit from a more nuanced discussion in more rarely taken into account explicitly in existing research (e.g., [8]). detail than we can include within the limited space available here. 4.1.3 Environment 4.1 Input: Data and Information The context of communication is very important [78], because it can This category focuses on information, context and environment of the influence the exchange itself, such that message properties have to be communication. In contrast to the process model by Keim et al. [40], interpreted differently. we specifically understand this category to focus on the information Collection Type — The collection type defines how the data was theoretical aspects and the collection process itself, not so much on any acquired and how it is leveraged. We propose to categorize it into preprocessing steps. As such, parts of the structuring are partly based a quadrant, consisting of the targeting methodology on the one side on classical communication research [62, 69, 78], which has extensively and the anonymity level on the other side (see the relevant part in discussed the content and meaning, context, and relationship aspects Figure 3). The former can be either targeted (specific communica- of communication. Building on this, we propose three main subcat- tion from a restricted set of users due to a trigger event) or untargeted egories: message, communication participant, and environment. We (unfound bulk collection). The latter can either be high (anonymized complement these aspects with properties relevant for digital systems, or pseudo-anonymized) or low (identifiable). Regarding the analy- like the data type and collection properties. sis, this can make a significant difference. For example, the targeted analysis of identifiable communication participants can focus on the 4.1.1 Message actual exchange and leverage context and relationship information. The untargeted analysis of pseudo-anonymized communication instead Communication can be regarded as the exchange of symbols. In com- often results in a search for the needle in the hay-stack and can not al- puter science, the term message is more common. The main considera- ways leverage background information. This results in vastly different tions are which meaning it contains and how it is transported: application development, scalability, and inference capabilities. Data Type — The data type refers to the content type of the data Analysis Coupling — Closely related is the analysis coupling, from a technical point of view. When looking at data classification in which can influence the communication coding due to privacy con- information visualization [10], we can identify several data types which siderations. Again, we categorize it into a quadrant, this time between are relevant for communication analysis: text, audio, image/video, (expected) knowledge about the analysis, which can be either known or network, and time-series. Based on the usage in current approaches unknown to be true. Note that the inverse, known to be false, is only (see Section 3), the two most relevant ones are text data (e.g., [12]) achievable with a specific degree of certainty. The coupling refers to and network (structure) data (e.g., [3]), encompassing the content and the effect that the (expected) analysis can influence the measurement. the communication network, respectively. However, the other data If participants expect their communication to be analyzed, they might types are also relevant. For example, communication can happen via communicate differently, being less honest, implicit, or communicate audio (e.g., telephone or VoIP) or via video chats, comprising audio not at all. This effect also depends on the targeting: If it is known that and moving images, i.e., video data. While our framework focuses all e-mails in the world are collected and analyzed (untargeted), they primarily on this written (i.e., text) communication, it is agnostic to the might have less influence than knowing that their e-mails are monitored data type of information. For example, it would be possible to include for sure (targeted), as this can influence the degree of scrutiny and so- the detection of facial expressions using deep learning [56] to analyze phistication of the analysis employed. For example, diplomatic cables the analogical code and set it in context. The last data type is time- are often very frank, mainly because they are not expected to be leaked series data (e.g., [43]), which is often counted to the meta-data and, for (assume untargeted, identifiable), while public policy discussions are example, can contain information about communication regularity. much more considerate (targeted, identifiable). Code/Intent — The meaning is the central aspect of the commu- Confidentiality — The previous point relates to the confidentiality nication, as evident by the plethora of approaches that aim to learn it. of the communication channel, which can influence the communication Based on Watzlawick et al. [78], the communication can be regarded coding, and therefore should be taken into account. as code, either digital or analogical (sic!). In a simplified expression, Strategy — A technical consideration which influences the analysis the digital code refers to the actual meaning of the transmitted infor- possibilities is the collection strategy. Is the data leveraged just a one- mation in a symbolic system, while the analogical code refers to how off dataset that contains past information only (static S ) like [4], or something is communicated, including cues, and context information. does the data collection reach the present, possibly is actively running, While the former is, by definition, present in virtually all approaches and the information increases over time (dynamic D ) like [60]. They we compared (e.g., [31, 42]), analogical analysis is rare (e.g., [35]). differ as the extracted knowledge might (significantly) change over Expressiveness — Similar, but conceptually slightly different, is the time with new information available, affecting the analytical process. expressiveness, which describes the nature of the information extracted Continuity — Along with the previous consideration, one can con- from the content. It can be explicit factual information that is evident sider the technical realization of a dynamic data collection. The options directly from the content or information implicitly contained and must are either a one time only O approach, where a fixed dataset is loaded be inferred, for example from the semantic. Code and expressiveness once (e.g., [23]), a batch B approach that supports an extension with together allow to classify systems by their capability to leverage both a new data (e.g., [38]) or, a streaming S approach for a continuous digital (e.g., actual content information) and also analogical codes (e.g., integration (e.g., [8]). However, most approaches only support O .
Input: Data and Information Processing and Models Visual Interface Knowledge Generation → Section 4.1 → Section 4.2 → Section 4.3 → Section 4.4 Message Analytical Goal Representation Output Properties Supported methodologies Properties Direct Outcome Data Type Representation Method Explanation Text Node-link based Numerical Confirmatory Audio Timeline-based Textual Image/Video Matrix-based Graphical Exploratory Network Other None Time-series Multiple Predictive Transition Function Code / Intent Pane Model Scope Machine Model Digital Aspects 2D Mental Model Analogical 3D None Modality VR Expressiveness Verification Approach Explicit (Factual) Meta-Data Interaction Aspects Implicit (Semantic) Network Model-View Relationship Content Factors Comm. Participants Collection Operation Method Target and relationship Confidence Single-Environment Select Trust Multi-Environment Explore Privacy Target Scope Reconfigure Encode Evaluation Self (note) Data-Mapping Abstract / Elaborate Single Filter Case Study Group Connect Comparison Processing Power Relationship Qualitative Methods and Properties Quantitative Symmetrical Manipulation Complimentary Direct Knowledge Gain Group Communic. (e.g., data objects, Environment visual objects) Inferred Knowlegde . l G ua d id up te iv Indirect es ro d Collection properties In G N Target (e.g., parameters) Individual Time Dimensionality Source Group Collection Type en ized e Refinement bl m ia t en ny tif re Nested G. es st o tu An Pa Fu Id Pr Targeted Model-View Relationship Prediciton Past Untargeted Basis Latency Present Analysis Coupling Goal Future X X X d In-Person te d ge te ge ar nt r Ta Live Model Tuning U Unknown Delayed Data Tuning Predictive Power Known Other t en re Confidentality es Scalability st tu Pa Fu Pr Time Strategy Result Explanation Private Data Ingress Transition Public Search / Analysis Active Learning Iterative Strategy Progressive Domain-spec. Asp. Static (Past Only) Dynamic (Current/Future) Continuity One-time Only Batch Stream Fig. 3. Our conceptual framework of communication analysis systems. It consists of the four main dimensions Input: Data and Information, Processing and Models, Visual Interface, and Knowledge Generation, described in more detail in Section 3 and also in Figure 1. Each category contains several properties and sub-properties, which allow for systematical analysis of such systems. Note that the graphic highlights some of the most relevant aspects for individual properties, which can be used for a simplified comparison. However, the properties themselves are multifaceted and, for a detailed analysis, should be discussed more nuanced and in more detail than indicated by these examples.
Table 2. Communication analysis system classification summarizing the different properties of the approaches. The classification criteria are formed on a subset based on the conceptual framework we developed in Section 4 and which is shown in Figure 3. The selected approaches and most categories are also available on a dedicated website at communication-analysis.dbvis.de (for details, see Section 5.2). Generic Properties: Approach supports a specific property or not #, or support is only very basic or limited G # (e.g., show as associated entry without analysis). † : Indicates approaches which are commercial systems or widely used in the industry. The following encodings have special symbols: Strategy: The approach supports a static S or a dynamic D environment. Continuity: The approach supports either a one-time only data loading O , iterative batches of data B or a data stream S . Collection: The approach supports either the analysis of single environments S or of multiple environment M simultaneously. Group Communication: The approach supports different types of group communication analysis, as encoded in a 3x3 matrix. The rows (source) and columns (target) specify individual, group, and nested groups, respectively. For example, encodes that the approach supports only the analysis of communication between individuals and, additionally, of individual to groups. Groups as source are not supported and nested groups not at all. Latency: The approach supports either the analysis of in-person communication I , the live L analysis of communication or a delayed D analysis a posteriori. Scalability: The approach supports analysis of tens (I), thousands (II), or millions (IIII) of cases for analysis. Evaluation: Case study or example , comparison with other techniques , and expert study . Time Dimensionality: The approach supports different time dimensionalities, as encoded in a 2x3 matrix. The basis of the knowledge is encoded in the rows (either past or present), and the prediction columns specify if knowledge about the past, present, or future is inferred. For example, encodes that the approach only uses past knowledge to learn about the past, e.g., to find a historic pattern. The encoding instead indicates the approach uses past and present data to predict future developments. Predictive Power: The approach’s predictive power encodes in a 2x3 matrix the type of output (explanation or transition, as rows) in relation to the time dimensionalities past, present, and future, as columns. For example, encodes a explanation about the past, while can also give explanations about the future. Instead, the encoding indicates that the approach supports both a explanation but also a transition function to explain past as well as classify present and predict future developments. Predictive Power Transition Func. Analysis Coup. Expressiveness Methodologies Data-Mapping Group Comm. Operation M. Manipulation Code / Intent Explanation Power Rel. Time Dim. Evaluation Continuity Scalability Collection Data Type Modality Strategy Strategy Latency Factors Goal Pane Active Learning Machine Model Representation Model Tuning Mental Model Confirmatory Single/Multi Data Tuning Reconfigure Time-Series Exploratory Progressive Confidence Analogical Meta-Data Numerical Reference Predictive Graphical Support? Support? Analysis Network Network Abstract Connect Iterative Explicit Implicit Content Support Explore Indirect Encode Privacy I./L./D. Textual Ingress Digital Matrix Matrix Matrix Audio Image O/B/S Direct Select Types Filter Trust Text S/D VR 2D 3D Approach Pajek† [4] # # # # # # # # S O # # G # S # D IIII II # # G # # # # # G # G # Gephi† [3] # # # # # # # # S B # # G # M # G D IIII II # G # # G # # # # # G # G # SaNDVis [63] # # # # # # # S O # # S # D IIII II # # # # # # # # # # # Ghani et al. [26] # # # # # # # S B # # G S # D II II # # # # # # # # # # # Node-link Elzen et al. [77] # G # # # # # # # S B # # G S # D II II # # # # # # # # # # NEREx [15] # # # # # # # S O # # S # D IIII II # # # # # # # # i2 Analyst’s NB† [38] # G # G # G # # # # S B M D IIII I # # # # G # G # G - Palantir Gotham† [60] # G # G # G # # # # D S M L D IIII I # # # # G # G # G - DataWalk† [13] # G # G # G # # # # S B M D IIII I # # # # # G # G # G - Nuix D./I.† [58] # G # G # G # # # # D B M L D IIII II # # # G # G # G - Reccurence Plots [1] # # # # # G # # # S O # # # S # D II II # # # # # # # # # # # # # # Matrix GestaltMatrix [7] # # # # # # # # S O # # # # S # D II I # # # # # # # # # # # # # # # # # # # # SmallMultiPiles [2] # # # # # # # # S O # # # S # D II II # # # # # # # # # # # HyperMatrix [17] # # # # # # G # S O # G S # D II I # # # G # # TextFlow [12] # # # # G # # # S O # # S # D II II # # # # # # # # # # CloudLines [43] # # # # # # # S O # # # G S # D IIII I # # # # # # # # # # # # # # DecisionFlow [27] # G # # # # # # # S O # # # G S # D II II # # # # # # # # # # OpinionFlow [79] # # # # G # G # # S O # G # S # D IIII I # # # # # # # # # # Timline Han et al. [30] # # # # # G # G # # S O # G # # G S # D IIII II # # # # # # # # # # # # Liu et al. [49] # # # # # # # S O # # S # D II I # # # # # # # # # # # # # ThreadReconst. [16] # # # # # # S O # # S # D IIII II # # # # # netflower [74] # # # # # # S O # # # G S # D II I # # # # # # # # # WeSeer [47] # # G # G # G # # # S O # G S # D IIII II # # # # # # # # MatrixExplorer [31] # # # # # # # S O # # S # D II I # # # # # # # # NodeTrix [32] # # # # # # # S O # # S # D II I # # # # # # # # Hadlak et al. [29] # # # # # # # S O # # S # D II I # # # G # # # # # # # Multi-Paradigm ConVis [35] # # # S O # S # D II II # # # # # # # # TargetVue [9] # # G # G # G # # S O # S # D II II # # # # # # # # MediaDiscourse [51] # # # # # # # S O # # S # D II I # # # # # # # # VisOHC [45] # # # # G # # S O # # S # D II I # # # # # # iForum [23] # # # G # G # # S O # S # D II II # # # # # # # # VASSL [41] # # G # G # # # # S S # # S # D II I # # # # # # # G # # CommAID [18] # # G # # G # G S O # M # G D II I # # # # Zhao et al. [82] # # # # G # G # # # S O # # S # D II I # # # # # # # Whisper [8] # # G # G # # S O # S # D IIII I # # # # # # # # # Other ConToVi [14] # # # G # # # S O # # # S # D IIII II # # # # # # # # # # Beagle [42] # # # G # # # S O # S # D IIII II # # # # # # # # # #
4.2 Processing and Models 4.3 Visual Interface This category focuses on the particularities of the processing and the Here, we describe the visual interface on an abstract level. models, mainly from a technical perspective in visual interactive data analysis, in the three subcategories, namely the analytical goal (why?), 4.3.1 Representation the model scope (what?) and the processing (how?). The central aspect of visualization systems are its visualizations. Method — We follow the established nomenclature of visualization 4.2.1 Analytical Goal techniques for the visualization method. However, we only chose One essential distinguishing criteria between approaches are the sup- the three most common ones, which account for the vast majority of ported methodologies to achieve the analytical goal. We align this approaches, comping up with the following selection: node-link-based classification on the standard definition of analytical tasks in visual ana- (e.g., [4]), timeline-based (e.g., [27]), and matrix-based (e.g., [17]). lytics [67]. This includes the category representation for a fixed analysis Alternatively, there is an option for other (e.g., [14]) techniques and task, confirmatory analysis for a directed search and exploratory analy- multiple (e.g., [23]) for multi-paradigm approaches. sis for a undirected search (supported by most approaches). Another Pane — The different visualization methods can be employed in goal in communication analysis involves predictive analysis (e.g., [79]), different visualization panes. We consider the three major ones, namely which could also be regarded as a sub-goal of all methodologies. 2D, 3D, and VR. For example, a communication network can be visual- ized as a node-link diagram in 2D or in 3D, or in a virtual environment, 4.2.2 Model Scope and each choice may influence the interaction concepts. Here, we discuss generic capabilities of the analysis models. Modality — The analysis modality refers to the employed concepts 4.3.2 Interaction and categorizes into the three common aspects: meta-data (e.g., [43]), The interaction methods are of central importance in visual analytics. network (e.g., [3]), and content (e.g., [16]), as used in many technical Operation Method — We classify the approaches based on descriptions of communication analysis. their interaction method according to the classification developed Collection — The different approaches provide various degrees by Yi et al. [80], namely Select, Explore, Reconfigure, Encode, Ab- of integration level regarding multiple collection environments. For stract/Elaborate, Filter, and Connect. In visual analytics, some are example, many systems can only consider a single dataset (e.g., a set of extremely common, while others like encode depend on the capabilities. e-mails) and therefore support only single-environment S (e.g., [26]), Manipulation — The manipulation of the elements can be either while a few also support the integration of multiple data sources, which direct, for example when interacting with data or visual objects. Alterna- makes them multi-environment M systems (e.g., [38]). tively, it can be indirect, for example when only modifying parameters. Data-Mapping — While most systems require a fixed data format, Most approaches support both manipulation methods. some also support a flexible import system which allows to map prop- erties and, in case they support multiple environments, to correlate 4.3.3 Refinement different aspects, for example e-mail sender and usernames. In addition to the interaction concept, other discerning factors are the 4.2.3 Processing particularities of the refinement. For these concepts, we also align with standard visual analytics concepts. Due to the breadth of different methods, we focus on generic aspects, Goal — Two primary goals can be differentiated: is the goal to tune namely the support for group communication analysis, and considera- an underlying model (e.g., [17]) or the data (e.g., [49])? tions for latency and scalability. Strategy — Depending on the approach, the strategy can be very Group Communication — One rather interesting aspect we dis- different: does it follow an active learning approach (e.g., [17]), or covered during the analysis of the approaches is the varying support is the process either iterative (e.g., [43]) or progressive (e.g., [14]) of group communications. Practically all approaches support an indi- instead? vidual to individual analysis. We propose to structure the approaches based on their support to analyze the communication between a source 4.4 Knowledge Generation and a target in a 3x3 matrix, e.g., individual to individual is encoded as (e.g., [9]). These entities can be either individuals, groups, or The last category is concerned with the particularities of the knowledge groups within groups. The difference between the two latter is subtle that can be generated and learned. We propose the three subcategories but involves the support of boundary restrictions on sub-groups instead output, verification approach, and knowledge gain. of overlapping group assignments. One might expect the matrix to be 4.4.1 Output symmetric, but this might not be always the case. Latency — Research [62] indicates that latency in the communica- Based on the classification of Spinner et al. [73], we propose two tion can significantly affect it. The two primary options are (nearly) distinct categories for the learned knowledge type: An explanation instantaneous communication, like in an active live L chat (e.g., [60]) can consist, for example, of numerical (e.g., [3]), textual (e.g., [74]), or delayed D communication (e.g., [45]), such as e-mail or as a doc- or graphical representations (e.g., [7]). It represents knowledge but ument. Note that the choice is not purely categorical, and from a in a factual representation that is not easily transferable and can be technical perspective, this could also be considered a numerical prop- regarded as a (final) result of the existing data. In contrast to this, erty. Differentiation into these two groups [62] is often enough for another type of result can be a transition function, which is closer to an most differences in reaction and behavior, although the latency can, of actual model, one example being the analysts mental model. Another course, play a role (e.g., answering under time pressure). In consider- type is a machine model (e.g., [79], for example, a trained classifier ation of analogical codes, however, it makes sense to include a third, that encapsulates learned knowledge and can be applied to new data. orthogonal category: In-person I communication (a subgroup of live communication) can exhibit markedly different results compared to, 4.4.2 Verification Approach for example, live chat conversations. The presentation of knowledge raises the question of its verification. Scalability — The scalability can be defined on two levels: First, Factors — Visual analytics is very well suited to address important on the data-ingress level, which defines the amount of input data the factors like confidence, trust and privacy. For example, probability system can import, analyze, and visualize initially. The second, more scores could be used to estimate the results’ confidence stemming from subtle aspect is the scalability on the search and analysis side, for automatic processes (e.g., [17]) and visualize it to the expert. Other example, during exploratory analysis. For example, how many results examples include analysis logfiles, integrity protection, traceability, and can be shown simultaneously? For both aspects, we roughly categorize verify-ability, as well as a provenance history. Lack of such certainty them into few (less than ten, I), medium (order of hundredths-thousands, measures might exclude systems from sensitive areas. While essential, II), and huge (more than 10k, IIII). many approaches do not consider this as much as would be desirable.
Evaluation — To evaluate approaches, the paper authors own eval- when network data is included, the visualization is often node-link- uation can be considered. It can be described as an in-situ analysis of based or multi-paradigm, while for time-series, it is either timeline or the approach itself. Several options are possible: Either a convincing multi-paradigm. Given the scope of the survey, the lack of audio and example or a case study (e.g., [32]). A second option would be to image is not surprising. Given that all the approaches belong to the conduct a comparison with similar, existing approaches (e.g., [17]). category of visual analytics, it is also unsurprising that virtually all A third option would be a qualitative interview (e.g., [18]). A support representative, confirmatory and exploratory analysis as their fourth option would be a quantitative user study . analysis methodology, their operation methods consist of many of the possible options, and their explanation is at least always graphical. 4.4.3 Knowledge Gain More interesting, however, are the differences and the research As a final step in the learning process, the question arises which knowl- opportunities we can conclude from their discrepancies, which we edge is actually gained and how powerful the process is. highlight in the following for each category (see Figure 1). We highlight Time Dimensionality — The time dimensionality describes the the, in our consideration, top five opportunities with an star: . relationship between data and knowledge generation. A 2x3 matrix shows the possible combinations of data basis and prediction type, each I.1 Analysis of the Meaning and Analogical Code with the entries past, present, and future. For example (e.g., [29]), Only a subset of approaches analyzes the implicit meaning of the a system can use past data and predict past data, for example for a communication (e.g., [15,23,35]). However, almost none analyze search. Another example (e.g., [47]) would be a article analysis the analogical code of the communication, which can offer im- and prediction system which has been trained on past data to analyze a portant cues and influence the communication significantly, and text, either an existing one or one on the fly in the present and future. is an important research gap. Yet another example for a future prediction would be a model that forecasts communication activity based on past events. Note that the I.2 Include Power Relations future cannot be the basis due to causality. Again, almost no approach considers the power relations between Predictive Power — A second important consideration describes the participants, which can similarly influence the communication the predictive power of the knowledge generated, which is represented semantics, meaning, and modalities. as a 2x3 matrix, where the result (explanation or transition function) and the time are combined. For example , can a system explain I.3 Dynamic Analysis (i.e. show) past events (virtually all systems)? Or is it able to While some might consider this a production problem, the devel- provide factual information for future events (e.g., [47], internal model opment of systems that support the dynamic analysis of commu- is inaccessible). A more powerful example is a model that can be nication data and batch/stream approaches also sets considerable controlled and which can explain and predict (e.g., [79]). hurdles to established analysis and visualization methods, which Domain-specific Aspects — Last, but not least, depending on dif- makes it an interesting academic research problem. ferent analysis tasks, more specific options might be of interest. Listing these here would be out of scope and space, as discussed above. How- I.4 Research the Analysis Coupling ever, we could imagine this as future work (see Section 5). There is practically no research on how a system can mitigate a possible analysis coupling. 5 D ISCUSSION AND F UTURE W ORK This section discusses the main findings and lessons learned before II.1 Multi-Environment Inclusion reflecting on the difficulties in creating a conceptual framework for Many approaches can only handle a single data source (environ- communication analysis. In particular, we discuss the potential impli- ment) and offer no data mapping (like, e.g., [38]). Instead, these cations and opportunities for future research while highlighting the still steps are outsourced to preprocessing. Their inclusion could make existing shortcomings in the current formalization. leveraging multiple data sources simultaneously less complicated. We have defined four main dimensions containing 30 different prop- erties that provide a conceptual framework to describe interactive com- II.2 Analyze Group Communication munication analysis systems which employ visual analytics principles. Only a few approaches support the analysis of group communica- We imagine several applications for this framework: Its primary tion (e.g., [14]), and almost none support nested groups. However, goal is to structure the research field and provide a common language much communication actually happens inside or between groups, for the community. Simultaneously, it can aid researchers in identifying which can involve specific particularities. research gaps, thereby supporting the transfer of theoretical concepts into practical systems. Further, it allows practitioners to chose the best fitting approach depending on their needs or checking out similar III.3 Explore Visualization Panes approaches, while enabling developers to judge the domain cover and Practically all approaches use the 2D pane for visualizations. It maturity of their systems. Lastly, it can form the foundation for a more might be worthwhile to explore 3D and VR concepts. Especially, detailed survey in the field of automated communication analysis. for example, for network analysis, VR might offer benefits. 5.1 Survey Findings and Research Opportunities IV.1 Model / Transfer Function / Knowledge Gain Few approaches contain an actual, powerful machine model to an- Visual analytics is especially suited to support semi-automatic com- alyze communication (e.g., [17, 58]). Using such models can po- munication analysis [33, 39]. The complexity, multi-modality, and tentially support the analysis, such as active learning, intelligent ambiguity of communication make it an ideal companion to interac- filtering, or confidence-based predictions. Transfer Functions tively combine domain knowledge from experts and computing power. would allow for a more universal machine learning, applying Concepts like interactive learning allows to refine models, while visu- knowledge to new problems, increasing the predictive power. alizations that support uncertainty awareness support the judgment of automatic results, fostering user trust and possibly identifying bias. Findings — To apply the framework, we have taken the 37 selected IV.2 Confidence, Trust, and Privacy approaches (see Section 3) and coded them according to our conceptual These factors are insufficiently considered the majority of ap- framework. Based on the results in Table 2, we can discover several proaches, leading to a black-box analysis. Here it would be possi- interesting aspects. For once, for the data type, while the analysis of ble to include confidence estimates (e.g., [17]), logs, provenance text data seems mostly universal across representation methods, this histories (e.g., [18]), data minimization, and other concepts. is not the case for the other data types. Somewhat unsurprisingly,
IV.3 Comparisons and Quantitative User Studies approaches and justified when required. However, given sparse taxon- While many approaches include case studies and (qualitative) omy and non-standardized vocabulary, some groupings and namings expert interviews, almost none do actual comparisons with related could arguably have been chosen differently with the same validity. To approaches or conduct quantitative user studies. advance research in this area, we decided to propose our framework as a first possible draft and one step towards a universally accepted frame- O.1 Holistic Approaches work. We, therefore, invite the research community to give feedback to Only a few approaches work towards a holistic analysis by consid- stimulate the scientific discussion and extend upon the framework in ering multiple analysis aspects together and in context, covering further iterations, which can be enhanced further by more input from all modalities. Even fewer, for example, support cross-matches. diverse research communities. As part of this process, the individual, As communication analysis is highly heterogeneous, such a per- multi-faceted aspects can also be formalized in more detail. spective can increase the analysis capability and considerably. Another aspect is the extension of the framework to non-human communication. Several aspects of the framework could similarity be applied to communication in general, for example, machine to machine. O.2 Context / Analysis Reference Window Indeed nothing in the framework is specifically tailored to a human Similar to the holistic analysis, a specific focus on the communica- communicator, as all the properties could also apply to other types tion context in reference to each other should be explored further of communication. However, human communication is often more for both inter- and intra-modality analysis. Few approaches con- nuanced than machine communication, making parts of the framework sider other modalities or external factors to explain particularities. less relevant, while other features (e.g., structuring, exchange content For example, a break in a communication sequence might appear scope) might be missing so far. as a gap, but when combined with location information (e.g., same building) might indicate that the participants might have 6 C ONCLUSION met for lunch and continued their conversation there, indicate an air travel period, etc. In summary, the correct interpretation In the last decades, communication analysis has experienced a shift of communication is extremely context depended. This context, away from manual analysis to computer-aided or even highly auto- however, is variable and many different possible contexts exist. mated approaches. However, to the same extend as automation levels It seems worthwhile to explore concepts to support such analyses increased, the analysis itself has often become more specialized, mov- better, because existing support is often very basic. ing away from an overarching exploration. This trend is in contradiction to traditional communication research, which stresses the importance of an holistic approach to capture the full meaning and context of com- 5.2 Limitations and Future Work munication. As a result, many modern digital communication analysis While our conceptual framework offers exciting research opportunities systems are highly adapted to a narrow range of tasks, often either for more powerful communication analysis systems and the application in the area of content or in network analysis. While this might be of visual analytics concepts, the area itself is under active research with perfectly sufficient and suitable for their intended use, such an isolated several limitations and unsolved challenges. analysis can sometimes lead to an less effective exploration and lead to One problem with the described taxonomy is the basis it is designed incomplete or biased results. Using separate approaches requires more upon (completeness). We are confident we included the essential re- manual work, often complicates analysis tasks, can introduce domain sults from classical communication research, technical considerations, discontinuities, and increase the struggle domain experts face when and many aspects from the surveyed approaches. However, one issue trying to integrate their domain knowledge. Further, isolated analysis is the completeness of the surveyed approaches. A significant prob- may not be sufficient to capture the full available information and can lem in this research area is that relevant approaches are rarely labeled make the automatic as well as the manual detection of cross-matches as belonging to communication analysis and are not easily definable more difficult. The development of more holistic and advanced ap- by a set of keywords. We initially thought about compiling a list of proaches for automated communication analysis systems is thereby domain-specific keywords to select papers by, for example, social net- hindered by the lack of a clear framework and the absence of a common work analysis, sentiment analysis, e-mail analysis, etc. However, we language which combines both technical aspects as well as results from found it highly likely that such a selection would be highly biased by traditional communication research. our knowledge of relevant topics, which is why we decided to go for a We address this challenge by developing and formalizing a design more extensive drag-net search to include previously unknown areas. space for digital communication analysis systems based on the existing However, while we took great care when defining the seed search and tool landscape and communication research, while making the case how then selecting the approaches, it is inadvertently likely we missed indi- visual analytics principles can be employed for a more holistic approach. vidual ones, not least by restricting the target journals and conferences. By systematically discussing and structuring the different analysis areas Also, it could happen that a few approaches fell through our automatic and aspects of the design space we arrive at an conceptual framework to or manual search pattern (see Section 3), while still being applicable provide an overview and assess the maturity of communication analysis to communication analysis. Some missing approaches may contain systems. As part of an initial survey, we have also categorized a large novel aspects which might widen the framework and introduce new set of existing approaches using our framework. elements or pose problems for existing ones. However, as this paper By bridging the gap in the formalization of digital communication is not a complete survey and the collection of approaches is primarily analysis systems by describing a design space for communication anal- used as one of three parts to design a helpful formalization scheme, the ysis, we aim to provide researchers with a common language, provide collection only has to cover most techniques and different methods and guidelines for building and assessing the maturity of such approaches, still be beneficial for many approaches. Nevertheless, to address the as well as point out gaps in the literature which offer exciting research issue of missing approaches, we created an accompanying survey-like opportunities. The results of this work are widely applicable in a va- website available at communication-analysis.dbvis.de, which lists the riety of domains which are concerned with communication analysis approaches we considered and also allows readers to submit methods like civil security, the digital humanities, or business intelligence, both they consider relevant and that have not yet been included. from a theoretical point of view as well as for the development of more While the previous limitations were concerned with the selected powerful communication analysis systems. approaches, another possible limitation concerns our framework it- self. It may contain overlaps, and the basis it provides is not entirely orthogonal. Given the complexity of the topic, we think it is chal- lenging to create a wholly consistent yet easily usable taxonomy, and there is a need to balance the different trade-offs. The choices we made for selecting the categories are based on existing research and
You can also read