DEPARTMENT OF INFORMATICS - IT-Assisted Provision of Product Data to Online Retailers in the Home & Living Sector - Software Engineering for ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DEPARTMENT OF INFORMATICS TECHNISCHE UNIVERSITÄT MÜNCHEN Master’s Thesis in Informatics IT-Assisted Provision of Product Data to Online Retailers in the Home & Living Sector Philipp Schlieker
DEPARTMENT OF INFORMATICS TECHNISCHE UNIVERSITÄT MÜNCHEN Master’s Thesis in Informatics IT-Assisted Provision of Product Data to Online Retailers in the Home & Living Sector IT-Unterstützte Bereitstellung von Produktdaten an Onlineshops im Home- & Living-Bereich Author: Philipp Schlieker Supervisor: Prof. Dr. Florian Matthes Advisor: Tim Schopf Submission Date: 15.08.2021
I confirm that this master’s thesis in informatics is my own work and I have documented all sources and material used. Munich, 15.08.2021 Philipp Schlieker
Acknowledgments I want to thank all people that helped me write this thesis. First, I would like to thank everyone that took time out of their busy schedules to help me conduct my interviews. This has been truly helpful, and apart from writing this thesis, it allowed me to learn a lot. Next, I would like to thank my advisor Tim for all his input, ideas, and feedback. Further, I would like to thank Prof. Matthes for his very good and pointed questions. This is followed by my gratitude for the support and understanding of my co-founder Daniel during the last months. Last but not least, I would like to show my appreciation for all the never-ending help of my girlfriend Anika.
Abstract The proliferation of e-commerce in the Home & Living industry has increased the importance of product data, such as information about size, color, and material. In most cases, online retailers require their suppliers to provide them with this information about their products. This is mainly done by filling the information into Excel templates provided by the online retailers that define the syntactic and semantic structure. Due to a lack of systems to support the suppliers and the differences among these templates, this process is largely done manually. This thesis first presents a clear image of this process in practice by conducting interviews and analyzing the data structure of manufacturers and online retailers. Limited data quality of manufacturers, as well as great syntactic and semantic differences among the online retailers’ templates, pose challenges towards the automated exchange. Based on these results, different IT-based approaches to assisting with providing product data in the Home & Living industry are explored. A best-practice approach leveraging a common ontology and separating concerns is presented and evaluated as a proof of concept. Further conducted interviews confirm the proposed system. iv
Kurzfassung Die Verbreitung des E-Commerce im Home & Living-Bereich hat die Bedeutung von Pro- duktdaten wie Informationen zu Größe, Farbe und Material gesteigert. In den meisten Fällen verlangen Online-Händler von ihren Lieferanten, dass diese ihnen die Informationen zu ihren Produkten zur Verfügung stellen. Dies geschieht hauptsächlich durch das Einfüllen der Infor- mationen in Excel-Vorlagen der Online-Händler, welche die syntaktische und semantische Struktur definieren. Aufgrund fehlender Systeme zur Unterstützung der Lieferanten und der Unterschiede zwischen diesen Vorlagen wird dieser Prozess größtenteils manuell durchge- führt. In dieser Arbeit wird der Prozess zunächst anhand von Interviews und der Analyse der Datenstruktur von Herstellern und Online-Händlern dargestellt. Die eingeschränkte Da- tenqualität der Hersteller, sowie große syntaktische und semantische Unterschiede zwischen den Templates der Online-Händler stellen den automatisierten Austausch vor Herausfor- derungen. Basierend auf diesen Ergebnissen werden verschiedene IT-basierte Ansätze, für die IT-Unterstützte Bereitstellung von Produktdaten an Onlineshops im Home- & Living Bereich, untersucht. Ein Best-Practice-Ansatz, der eine gemeinsame Ontologie nutzt und Zuständigkeiten trennt, wird als Proof of Concept vorgestellt und bewertet. Durchgeführte Interviews bestätigen das vorgeschlagene System. v
Contents Acknowledgments iii Abstract iv Kurzfassung v 1. Introduction 1 2. Foundations 3 2.1. Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2. Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Related Work 6 3.1. Data Exchange in the Home & Living Sector . . . . . . . . . . . . . . . . . . . . 6 3.2. Industrial Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2.1. Metamodel-Based Information Integration . . . . . . . . . . . . . . . . . 8 3.2.2. Ontology & Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.3. Intra-Organizational Information Integration . . . . . . . . . . . . . . . . 12 3.2.4. Inter-Organizational Information Integration . . . . . . . . . . . . . . . . 13 3.3. Product Catalog Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1. Layered Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.2. Syntactic Integration using XML . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.3. Semantic Integration using Ontologies . . . . . . . . . . . . . . . . . . . 18 3.3.4. Integration using Mediators . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.5. Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4. State of the Art 23 5. IT-Assisted Provision of Product Data 26 5.1. Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1.1. Source Formats of Manufacturers . . . . . . . . . . . . . . . . . . . . . . 26 5.1.2. Target Formats of Online Shops . . . . . . . . . . . . . . . . . . . . . . . 27 5.1.3. Analysis of Transformations between Formats . . . . . . . . . . . . . . . 30 5.2. Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.1. Theoretical Evaluation Schemata . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.2. Possible Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2.3. Theoretical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 vi
Contents 5.3. Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3.1. Metamodel-Based Integration . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3.2. Separation of Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3.3. Mediator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3.4. Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.4. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4.1. Syntax Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4.2. Normalization Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4.3. Data Model Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4.4. Ontology Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4.5. Enrichment Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.5. Design of Common Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.5.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5.2. Purpose and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5.3. Building of Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.6. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.7. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.7.1. Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.7.2. Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6. Discussion 54 7. Conclusion 57 A. General Addenda 58 A.1. Interview Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 A.2. Interview Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 A.2.1. Interview 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 A.2.2. Interview 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 A.2.3. Interview 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A.2.4. Interview 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 A.2.5. Interview 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 A.2.6. Interview 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 List of Figures 70 List of Tables 71 Acronyms 72 Bibliography 73 vii
1. Introduction E-commerce is currently one of the fastest-growing sales channels [1]. In recent years this has also applied to the Home & Living industry [2] with 34% of customers in Germany expressing a preference of online over offline sales in a survey in 2017 [3]. Interviews have shown that product data, such as descriptions on size, material, and color, play a key component in customers purchasing decision process [4]. The present work addresses the exchange of this data between manufacturers in the Home & Living industry and their retail partners. The exchange of product data in the Home & Living industry comes along with data exchange challenges in B2B transactions. In the 2000s, the advent of XML documents that follow a predefined Document Type Definition (DTD) has eliminated the first interoperability challenges in many industries on a purely syntactic level. Following this many different standards for these XML documents emerged, leaving the challenge of integrating these especially on a semantic level [5]. The usage of ontologies as a shared definition of used vocabulary or as data sources including semantics has been proposed as a solution [6]. Large initiatives have tried to introduce common standards and ontologies with varying success. One reason for this is that there is often no agreement between different stakeholders on the proper structure of product data as well as different requirements. The lack of standards in many cases leaves the challenge of product data integration [7]. This is also the case in the Home & Living industry. The online share within the Home & Living industry has increased and thereby also the importance of product data. Interviews with suppliers of online retailers in the Home & Living industry have shown that the requirements towards product data have drastically increased. Whereas before, only common information such as packaging information was required, this has extended to very granular information, e.g., on the material. The provision of such product data is currently a mainly manual process, consuming large amounts of resources and being error-prone. The importance, as well as the challenges behind the process of product data exchange in other industries, has long been a topic of research [8][5][9][7]. Therefore, the present work explores approaches towards the IT-assisted provision of product data to online retailers in the Home & Living industry. The objective of the present work is to provide a clear picture of the current state of the art, an analysis of possible approaches, and a best-practice approach to guide future development. The thesis uses the following approach. First the current state of the art is analyzed. This is done through semi-structured expert interviews to ensure practical relevance. Throughout the interviews, it became clear that no standard exists within the industry. This leaves the challenge of product data exchange and integration. The most common approach is the exchange of Excel files between manufacturers and online retailers. In most cases, the online retailer will provide a template with prefilled values indicating the required structure. The 1
1. Introduction manufacturer will then fill in the information of the products to be sold. Next in order to get a deeper understanding of the requirements for product data integration, the product data structures of manufacturers and online retailers are analyzed. Based on the found challenges, best practices are identified within the literature. This is used to answer the question of which IT-based approaches could support users with the provision of product data. These are then compared, and the most promising is implemented and evaluated as a Proof of Concept (PoC). The comparison is done through experiments and semi-structured expert interviews. Last but not least, an outlook for future work is given. 2
2. Foundations 2.1. Research Methodology The research methodology applied in this work is based on the Design Science Research (DSR) Framework introduced by Hevner, March, Park, and Ram [10]. DSR aims to unify two components that are deemed to be fundamental to the Information System (IS) discipline. On the one side, behavioral science aiming to develop and verify theories that are connected to human or organizational behavior. On the other side, design science trying to innovate by creating new artifacts and thereby extending upon human possibilities. DSR combines these two sides into a framework for understanding, executing, and evaluating research. Figure 2.1 shows the resulting overall framework. Figure 2.1.: Overview of the Information Systems Research Framework by [10] The overall framework contains three main components: environment, IS research, and the knowledge base. The environment describes the problem space which is addressed. This includes mainly the people, organizations, and their existing or planned technologies. Together these describe the goals, tasks, problems, and opportunities the people perceive within the organization. This defines the business needs, also called problem. By addressing 3
2. Foundations business needs, relevance is given. These business needs are then addressed within IS research. IS research contains a cycle between the development and building of artifacts or theories and the justification and evaluation of these. The knowledge base provides the raw materials used within IS research and encompasses the foundations and methodologies to be applied and used. The results are added to the knowledge base for further research and practice. The knowledge base thereby ensures rigor within DSR. These three components structure the thesis. The environment is analyzed based on semi- structured interviews, which were conducted with experts in the field. The interviews were done after the implementation of the PoC to include feedback on the PoC. The interview guide is included in addendum A.1. The interviews were summarized according to the interview guide and can be found in addendum A.2. The identified business needs are discussed within chapter 4 on State of the Art. The knowledge base is analyzed in chapter 3 on Related Work. By analyzing prior work, common solution approaches, as well as best practices, are identified. These are applied within chapter 5 on IT-Assisted Provision of Product Data. This chapter is dedicated to IS research and extends the prior knowledge. The product data structures of 15 different manufacturers with product assortments ranging from accessories, lighting, kitchen supplies, bedding to small furniture are analyzed. Based on the classification of implisense1 they are micro to medium sized companies, having less than 249 employees and less than 50 Mio€ in annual revenue. The selection of manufacturers was restricted to manufacturers that sell to large online retailers and can provide their product data in one file. The target formats of eight online shops carrying products of the Home & Living sector are examined. Four of them are among the top ten revenue leaders of online shops in the Home & Living industry [11]. The requirements for transforming the data between source and target format are evaluated by experiment. Various approaches to the automation of the transformation process were developed within the loop of developing, building, justifying, and evaluating. These are analyzed from a theoretical point of view based on informed argument. The most promising approach was implemented as a PoC. The result is the artifact in form of a model as well as the implementation of a PoC system to solve the challenge at hand. The outcome was, on the one side, evaluated through experimentation concerning the required manual work. On the other side, through informed argument by presenting the results to experts during the semi-structured interviews. 2.2. Research Questions The research questions answered within this work are the following: RQ1: How do manufacturers in the Home & Living sector provide product data to online retailers? In order to ensure the relevance of the conducted research, the problem space is evaluated. This is done by providing a thorough analysis of the status quo on how product data is currently provided to online retailers in the Home & Living industry. On the one side, 1 https://blog.implisense.com/neue-einstufung-fuer-unternehmensgroessen-im-implisense-datenbestand/ 4
2. Foundations semi-structured interviews are conducted. On the other side, product data is transformed by experimentation to understand the problem better. RQ2: What are IT-based approaches for assisting manufacturers with the provision of product data to online retailers? In general, a variety of different solution approaches towards the provision of product data are possible. By analyzing common approaches within the literature and other industries, a set of possible approaches is developed. RQ3: What approaches for the assisted provision of product data provide the greatest benefit for the user? These different approaches are then compared with regard to the greatest benefit for the user. The benefit of the user is defined as the to be expected Return on Investment (ROI). In the first step, this is done from a theoretical point of view. In the second step, the most promising approach is developed as PoC and evaluated. The evaluation is based on experimentation with sample data as well as semi-structured expert interviews. 5
3. Related Work 3.1. Data Exchange in the Home & Living Sector Over the years, approaches towards product data exchange, specifically in the furniture industry, have been proposed. One of them is the FunStep initiative1 , which together with its partners strives to facilitate and support interoperability within the worldwide furniture industry by developing and implementing e-business activities. This especially keeps in mind the requirement for information exchange along the supply chain with different external business partners [12]. The main motivations behind the initiative and its creation can be found in [13]. It remains to highlight that Nobilia, a large kitchen manufacturer, is among the six mentioned initial members. Hence, especially planning intensive products, such as in this case kitchens, were the original focus. This means that data, such as data for planning and order management, plays an important role. In order to support the overall goal of interoperability, the ISO-Norm 10303-2362 under the title “Industrial automation systems and integration — Product data representation and exchange — Part 236: Application protocol: Furniture catalog and interior design“ is introduced. In addition, an ontology was proposed, which among others, covers different pieces of furniture, as well as services, detailed logistics and manufacturing processes and techniques in the furniture industry [14]. Nevertheless, since its publication in 2006, neither the ISO-Norm nor the ontology have seen any widespread adoption in the industry and scientific publications. The ISO-Norm 10303-236 was applied within a large Brazilian furniture company. The learnings are discussed in [15], which highlight the process of transforming industry and company-specific knowledge into the ontology for seven different product pieces. They thereby showcase the steps of interviewing relevant stakeholders and integrating this infor- mation using Protégé into the ontology. As challenges, they identify vague definitions within the norm from a technical and user point of view. More specifically, they note that the norm was not always very clear to the members of the furniture industry. Last but not least, they emphasize the flexibility of the standard and point towards the risk that this flexibility will lead to ongoing challenges in the data exchange, hence not resolving the difficulties in data exchange. These conclusions drawn by [15] are however limited by the fact that the work focuses on the adoption only within the company and does not include any learnings from using the norm for data exchange. [16] study the information resources in the furniture industry as part of the Business Innova- tion and Virtual Enterprise Environment (BIVEE) project in Spain. The BIVEE project strives 1 http://www.funstep.org/ 2 https://www.iso.org/standard/42340.html 6
3. Related Work to promote innovations and production improvements in Small and Medium Enterprises (SMEs). In order to achieve this goal, they analyze various SMEs concerning their needs and challenges in respect to information resources. Thereby they include the requirements of AIDIMA (Technology Institute of Furniture, Wood, and Packaging), which was also part of the previously mentioned FunStep initiative, as the end-user. In the SMEs they worked with, they highlight the successful implementation of Enterprise Resource Planning (ERP) systems. However, they point to challenges in the planning of the production. Regarding the previously mentioned FunStep ontology, they note that the ontology lacks references to production technologies from their point of view. Last but not least, the work addresses challenges with regard to change management with the introduction of new systems. As mentioned by looking at Nobilia, manufacturers of planning-intensive products, such as kitchens, are faced with a variety of challenges based on a large number of possible configu- rations. [17] develop an ontology for Verso Design Furniture Inc., a furniture customization company, to address the challenge of deciding whether a particular furniture combination is possible or not. Even though they were not able to prove that the ontology will not allow for furniture configurations that are not physically possible, their results seem promising considering that all existing joint combinations could be successfully represented. They point towards the opportunities presented by mass customization that can be enabled by ontologies. As another planning-intensive furniture segment, parts of the German office furniture indus- try have adopted a standard called OFML which is driven by the Industrieverband Büro und Arbeitswelt (Industrycooperation office and workenvironment)[18]. The adoption of OFML has simplified the exchange of data relevant for the planning of larger offices, including 3D data and aspects related to order management. [19] highlight the chances of a highly integrated production environment from a manufacturer’s point of view. Nevertheless, based on our interviews, the adoption remains limited to the office furniture segment and is not fully adapted to the needs of online resellers with regard to product information. [20] analyzes the status of integrations between businesses within the German furniture industry by conducting interviews. He notes that a wide variety of integrations is to be found even within the German furniture industry and selects the upholstery and kitchen industry to be further analyzed. Within the two segments, he mentions that the kitchen industry has widely adopted the IDM-Kitchen-Standard. In contrast, such a standard has not been adapted in the upholstery segment, even though a standard, IDM-Upholstery, is available. [20] therefore compares the influences on the German upholstery and kitchen industry for the establishment of infrastructures to facilitate the data exchange. A few common factors can be identified from the limited number of publications on data exchange in the furniture industry. The complexity involved in planning intensive products drives the need for standardization. The kitchen industry has therefore seen an interest in this regard by research as well as industry [20][13]. The office furniture industry has widely adopted a standard [18][19]. Other segments such as the upholstery, as well as other customizable furniture, have seen efforts in this direction with varying success [12][13][17][15]. The success of these approaches is highly dependent on the segment. In addition, these efforts are all restricted to specific planning-intensive furniture segments. Further, the adoption of 7
3. Related Work some of these standards is rather geographically limited, e.g., to Germany or Spain. 3.2. Industrial Information Integration The exchange of data between different systems and organizations can be seen in the general context of Industrial Information Integration. [21] defines the engineering of Industrial Information Integrations as “complex giant system that can advance and integrate the concepts, theory, and methods in each relevant discipline and open up a new discipline for industry information integration purposes“. Following this definition, Industrial Information Integration Engineering is the set of concepts as well as techniques that enable the integration process between different systems, especially with regards to information integration [22]. The general discipline can be structured along with the addressed discipline, e.g., engineering, management, social science, and the application engineering field, e.g., chemical engineering, civil engineering, and material engineering. Taking these structures into account [22] provides a thorough literature analysis of the discipline. Looking at the different approaches in different industries, the wide variety of challenges becomes evident. The following section will present common best practices within different areas. First, general approaches based on metamodels and ontologies are presented. Then information integration within organizations is discussed, followed by the discussion of the approaches between organizations. 3.2.1. Metamodel-Based Information Integration Generally speaking, integration problems can be described using metamodels. A metamodel is a model of models. Thus a metamodel defines what models are valid within the space of a certain modeling language. One of the most popular metamodels in software engineering is the Unified Modeling Language (UML), originally defined by the Object Management Group (OMG). Their architecture encompasses four different layers, with each layer being the type model of the layer below [23]. Figure 3.1 shows this hierarchy. As in the case of [24] this hierarchy can be used to clarify the different abstraction levels of a model, e.g., in the case of [24] a product ontology. Further metamodels have been used in information integration. [25] showcase a metamodel-based approach toward information integration at industrial scale. The approach is demonstrated by an example from the oil and gas industry sector. Their example transforms engineering assets, such as a fuel pump, between different standards, such as ISO and MIMOSA. They motivate their work by noting that the main challenge in information integration is created by the constant change of information systems and their models on the one side and the constant change of information requirements of applications and users on the other side. They explain that current approaches do not have enough flexi- bility to accommodate this constant change. Hence, they propose to address the integration at a higher level of abstraction through metamodel-based information integration. This means that the mappings between models become more flexible and reusable through mapping templates. The integration decisions are then made for small generic fragments of the models, e.g., a single conceptual entity such as a fuel pump in their case. Their approach contains 8
3. Related Work Figure 3.1.: The four level metamodel hierarchy defined by the Object Management Group [23] three different levels: the metamodel-level, the model-level, and the instance-level. At the metamodel-level, the different, to be integrated formats, are defined as entities, relationships, and mapping operators. At the model-level, mapping templates are specified to define how one part of the source metamodel needs to be represented in the target metamodel. These are then instantiated at the instance level by the application user who applies the mapping templates to his source model. Figure 3.2 shows this conceptual view. Coming back to the presented example from the oil and gas sector, the end-user can automatically create the ISO-compliant representation of a fuel pump from the MIMOSA model. Hence, automatically transforming the representation in one norm into the semantically equivalent representation in the other norm. [25] explain that, among other advantages, this approach greatly decreases complexity by separating different integration tasks into different layers and making these smaller integration decisions reusable. [26] present a metamodel for ontology mappings based on set and relation theory. They first motivate their work by explaining that many different ontologies exist covering overlap- ping concepts. This creates the challenge of exchanging information between them, reusing adjacent parts of other ontologies, or further synchronizing changes. For this integration, mappings between common concepts within the ontologies are needed. For the management of these mappings, they present a metamodel for ontology mappings. They define each single mapping between two sets of concepts of two different ontologies as a mapping. Further, they denote the set of mappings between some ontology models as a mapping model. These mapping models contain common elements and associations. The metamodel introduces the common structure of these mapping models. As components of this metamodel, they define the different elements of an ontology (e.g., OntologyElement, OESet, OESetGroup) and the different elements required for the mappings (e.g., Mapping, MappingClassification, MappingDefinitionRule). Since this is based on set and relation theory, further properties can be used, e.g., the synchronization of ontologies and automatic generation of mappings 9
3. Related Work Figure 3.2.: Conceptual view of metamodel-based integration approaches [25] among them. [27] define a generic metamodel for schema merging. Schema merging is defined as the task of combining several heterogeneous schemas into one unified schema. For this process, [27] give a formal definition of the resulting schema together with an algorithm to implement it. Similar to [26] they explain that the mapping between the elements of two to be unified schemas is not a simple set of one-to-one correspondences. Thus, representing a mapping model. Their approach is based on GeRoMe, a generic metamodel that in contrast to other metamodels includes semantic information to resolve conflicts in mappings and can be used with different metamodels, e.g., XML schemas. Based on GeRoMe [27] give formal definitions of models, mappings, and the merging operator. 3.2.2. Ontology & Schema Matching Ontologies provide a common vocabulary for a certain domain of interest. Depending on the specific definition, this encompasses several data and conceptual models, including terms, classifications, and database schemas [28]. Schemas are a formal definition of a structure of an artifact, such as a SQL schema, XML schema, interface definition, or ontology description[29]. Since both ontologies and schemata share the similarity of providing a vocabulary of terms, matching both is often done with similar solutions. Therefore, solutions from both areas are discussed within this chapter [28]. Schema and Ontology Matching can be defined as the problem of finding correspondences between elements between different schemas. Correspondences mean relationships between the elements, e.g., representing the same notion or information [29]. [30] define a classification of these approaches, including previous classifications such as [31]. Figure 3.3 displays the overview. From the top-down view, the perspective of granularity / input interpretation, the classification distinguishes along the following elements: • Element-level vs. structure-level: Element-level matching techniques only take an 10
3. Related Work Figure 3.3.: Classification of ontology & schema matching approaches [30] 11
3. Related Work element in isolation into account when calculating correspondences. In contrast, structure-level approaches consider the relations of elements with each other to calculate correspondences. • Semantic vs. syntactic: Syntactic approaches follow clearly stated algorithms which only analyze the input based on its sole structure. Semantic approaches use some formal semantics, such as model-theoretic semantics, to analyze the input and justify the results. Exact semantic algorithms are complete with respect to the semantic. Reading the classification bottom-up, looking at the origin / kind input, the classification provides the following categories: • Context-based: Context-based approaches do not restrict the information to a single ontology or schema, but rather look at information coming from external resources, such as other ontologies or thesaurus describing the terms of the ontology. The external resources are referred to as context. – Semantic: As previously seen, semantic approaches follow formal semantics for matching. – Syntactic: Syntactic approaches in this case could be further differentiated between terminological, structural and extensional as for content-based approaches. Due to their limited application in practice they are grouped to under syntactic approaches. • Content-based: Content-based approaches limit the information taken into account to the content of a single ontology or schema. – Terminological: Terminological approaches consider their input as string. – Structural: Structural approaches look at the structure of elements (classes, indi- viduals, relations) within the ontology. – Extensional: Extensional approaches use data instances to find correspondences. – Semantic: Semantic approaches work based on semantic interpretation of the input usually using a reasoner. Further information on the concrete classes can be found in [30]. A more detailed literature review looking at current advances in the area is provided by [32]. The visualization of mappings, especially when schemas and maps are larger, is challenging. Therefore, approaches towards the visualization of mappings have been proposed [33]. 3.2.3. Intra-Organizational Information Integration Especially the engineering field has long dealt with the challenge of information integra- tion, particularly concerning different data representations. [34] presents an ontology-based approach towards the data integration within the design process of chemical plants. The requirement for an ontology-based approach arises because different phases of the design process and different disciplines require different viewpoints. Making use of reasoning within 12
3. Related Work the ontology, they can satisfy these information demands and provide compliance checks. In the same direction, [35] propose the use of ontologies for information integration in industrial environments with multiple applications and data sources. They point out that the technical integration has been widely solved. Nevertheless, combining the semantics between the sources remains a challenge. In constantly changing environments, point-to-point integrations are expensive. As a solution, they propose to introduce an integration ontology. For this, they describe each data source as an ontology model and map these together to generate a uniform integration ontology. [36] and [37] address the challenge of different information requirements within one organi- zation during product development. This challenge is driven by the fact that the design of complex systems requires interactions between experts of different areas, such as Computer- Aided Design (CAD), Engineering (CAE), and Manufacturing (CAM). The challenge they address is to maintain consistency across the different design systems about related informa- tion. [36] therefore apply a Model-Driven Engineering approach from software design and propose a metamodel to integrate the information from each domain to create a shared refer- ence. This is done by first creating a mapping between the generic concepts and the specific knowledge model. Then based on the mapping result, the generic model is instantiated. [38], [39], [40], and [41] address the challenge of integrating the different data sources for product data at a somewhat different level by the use of product data management systems. Such systems integrate and manage all information related to a product across different life cycle stages such as design, manufacturing, and end-user support. Hence, they integrate different areas to ensure that the correct information is available in the proper form for the end-user [38]. [38] provides a review of web-based product data management systems. [39] propose a distributed, open and intelligent product data management system. By supporting standards such as Standard for Exchange of Product model data (STEP) they achieve the suggested openness. [40] propose an ontology-based Product Data Management (PDM) system. More recently [41] present a holistic view of this topic for the practitioner. 3.2.4. Inter-Organizational Information Integration Moving from the intra-organizational perspective to an inter-organizational perspective, the role of standards increases. Standards as support for joint agreements enable communication throughout different systems for a variety of user requirements in order to improve economic efficiency [42]. [42] show the use of different standards for engineering assets, noting that ISO and MIMOSA are the leading bodies for defining such standards. They provide a comprehensive review of the standards for the integration of engineering assets. Again for the oil and gas sector [25] highlight challenges involved in the use of different standards. A large number of different standards create the challenge of integrating these. Therefore, they showcase a metamodel-based approach towards integrating different ISO and MIMOSA norms. Other approaches towards the transformation of ISO-Norms by using ontologies can be found in [43]. One area of inter-organizational communication that has seen significant interest is e- procurement or e-business, which refers to the use of electronic communications for the 13
3. Related Work business processes between sellers and buyers. E-procurement integrates inter-organization business processes and systems to automate the requestion and the approval of order purchase management and the connected accounting processes using Internet-based protocols. Thereby, it can improve the efficiency not only of single purchases but also the overall administration and functioning of markets. Hence, it is seen as a strategic tool to improve the competi- tiveness of organizations as well as generating scale economies for both sellers and buyers [44]. Apart from the legal framework, resolving technical issues for the proper integration of heterogeneous environments is a key success factor. [44] review the current state of the art concerning the integration and list controlled vocabularies, ontologies, frameworks, as well as e-procurement platforms as the commonly proposed solutions. They consider Electronic Data Interchange (EDI), company websites, B2B hubs, e-procurement systems, and web services as relevant architectures. [45] compare the different standards platforms used for order management. They conclude that there is currently a lack of common standards. [46] point towards EDI as being used in the domain for supporting the inter-organizational information exchange with a focus on order management. However, as [46] point out, different systems still require complex message transformations in order to become compatible with each other. A range of different publications have suggested solutions. [46] propose a visual mapping system to handle different EDI and XML-based applications. [47] suggest creating an ontology for EDI to support easy integration. Among others [48], and [44] discuss the creation of B2B marketplaces or B2B hubs in order to facilitate the data exchange. Taking a look at this wide range of challenges, the general context of product data exchange within the domain of Industrial Information Integration becomes clear. First and foremost, the challenge of integrating different data sources needs to be resolved on an intra-organizational level. Especially in the field of engineering, ontologies, and central representations have been proposed for this. Concerning product data, PDM or Product Information Management (PIM) systems are the standard solution for integrating information about products at a central place. In order to create a shared understanding for data exchange at an inter-organizational level standards are a common solution. Nevertheless, because of the number of different standards, integrations are still often needed. In e-procurement EDI has long been a standard. Nonetheless, EDI has not entirely resolved the challenge of interoperability in this area. 3.3. Product Catalog Integration As seen in the previous chapter, data integration from different organizations remains a challenge. As [7] note, this applies especially to product data and product description due to the autonomy of vendors in describing their products. Similar to the engineering domain, standards for product data and catalog exchange have evolved. [49] discuss the design of the catalog exchange process and review four different XML-based standards for this. This is done by considering the whole process of product catalog exchange and defining the requirements of different stakeholders. They conclude that none of the four selected standards, BMEcat, cXML, OAGIS, and xCBL, satisfy the found requirements, especially regarding requirements from e-markets and content hubs. 14
3. Related Work They point out that none of the standards include further semantic checks and a feedback mechanism on potential errors during import. [50] provide an even more granular analysis of the functionality of each data format. In addition to the XML-based standards [48] mention two non-XML catalog formats. EDIFACT, as a by the United Nations Economic Commission for Europe approved format and the ISO 10303-41 known as part of the STEP family. Due to the complexity of EDIFACT, The United Nations Centre for Trade Facilitation and Electronic Business Apart has already published an XML format for EDIFACT. The same applies for STEP. Apart from pure product information, the categorization of products and services is often of interest, e.g., for accounting purposes. [49] note that none of the reviewed standards contain these. [51] present an analysis of the different categorization standards eCl@ass, UNSPSC, eOTD, and RosettaNet, noting that they differ in structural properties as well as the content. [52] describe the shortcomings of the UNSPSC from a practitioners’ point of view. The lack of standards and the remaining challenges when using standards underline the point made by [7] and [53]. They argue that the current degree of acceptance and the multiplicity of standards hinder the progress of standardization. Further [54] explain that standards are slow to adapt to changes and emergent requirements. As seen in section 3.1 this also applies to the Home & Living industry. [7] explain that the alternative to the simplification of the integration using standards is product schema integration, also referred to as product data mapping. Product schema integration can be defined as the process of building mappings between different product attributes from different product descriptions [7]. Using these mappings, product data from different sources can be integrated and unified. 3.3.1. Layered Integration The challenges within product schema integration are mainly twofold: syntactic and semantic integration [49]. [55] suggest separating concerns into different layers in order to reduce complexity in data integration on the semantic web. Their concept includes three layers: a syntax layer, an object layer, and a semantic layer. The syntax layer is responsible for serializing and de-serializing objects stored in a given file, thereby handling the encoding and file format. The object layer provides object-oriented access for the application that later uses the data. This also includes the provision of identities and binary relationships, as well as basic typing. Last, the semantic layer provides an interpretation of the object model from the object layer. Hence, the objects are mapped onto physical or abstract objects such as books, airplane tickets, and paragraphs of text. [56] follow this concept and introduce three different layers in their model. Generally speak- ing, [56] showcase a system that allows the transformation of XML catalogs between different structures and formats. As later seen in section 3.3.2, they demonstrate that a set of direct rules can be used to translate one catalog directly from one format into another. However, they note that this approach is not suitable for building a scalable mapping service. Using direct rules makes these rules very difficult to write and hence also to maintain. Further, the re-usage of the rules is limited. In consequence, they propose the usage of three distinct layers to separate different concerns. This allows dividing complex transformations into smaller, simpler rules, which are then concatenated. [56] suggest that the identification of reusable 15
3. Related Work rule patterns for these smaller rules is then feasible. In order to achieve this goal, [56] use three different layers, which align with the previously seen concept of [55]. The first layer is the syntax layer which is responsible for handling the de-serialization of the XML documents. In the data model layer, the products are then represented by object-property-value triplets, removing differences imposed by different representations in the syntax layer. This means that the properties are normalized and aligned with the structure of the following layer, the ontology layer. The ontology layer contains the actual mappings between different elements. Taking the example of an address, the address is first de-serialized from the XML document. Next, it is normalized, e.g., street name and house number are separated into different fields. Last, the position in the target document is assigned according to the ontology. Figure 3.4 shows the model of this integration approach. Figure 3.4.: Layers of integration of approach by [56] [57] also use different layers to transform product information between different sources. However, their approach does not specifically add a syntax layer. Further, it adds a layer in order to support differences imposed by geographic differences. Their concept includes three different layers: source, local, and common. The source to local mapping normalizes and extends the information by adding implicit information, e.g., the currency of the source catalog. The local to common mapping then maps this information into a common format that is parsed back into the local and source schema. As mentioned, this separation allows to efficiently handle geographic differences, such as the language of the source and the target catalog. [52] explain that B2B business transactions over the internet present the challenge of inte- grating information from many different sources. By working as an intermediate layer, B2B 16
3. Related Work marketplaces propose a solution to this challenge. However, this requires the B2B market- places to integrate the different sources. Their work elaborates in great detail on the different challenges encountered when integrating multiple product catalogs and transforming them between different formats. In addition to the seen requirements, they recognize that the prod- uct description of the different catalogs is often unstructured and hence not easily computer interpretable. Hence, add another layer to extract and structure the information. As seen throughout this section, product catalog integration consists of a variety of different tasks. The other sections are structured along these. The first section considers the syntactic integration, hence parsing the given input file into objects, thereby removing differences imposed by different file formats and encodings. Since most literature only considers XML documents, the focus lies on syntactically integrating these. Afterward, approaches to seman- tic integration focusing on ontologies are presented. As already highlighted, the integration of different sources remains a challenging task under any circumstance. The following section explores approaches using a mediator, also referred to as intermediate layer, to reduce the number of integrations needed [52]. Last but not least, different approaches towards information extraction from unstructured sources are discussed. 3.3.2. Syntactic Integration using XML XML is the prevailing data format within the literature, as it is the underlying format for the previously presented standards. It remains to note that others have also considered formats, such as HTML and Microsoft Excel [58]. However, as these do not specifically tackle the challenge of product data integration, these are not further explored here. The main reason for the adoption of XML is that it is seen as an important step towards the reduction of challenges involving the heterogeneity of data exchange between different systems [59]. This is the case, even though the standard does not touch the structural and semantic differences, which means that semantically identical properties can be encoded in XML elements with different names. Moreover, elements with the same name do not necessarily have the same semantics. Further, the order of XML tags is of relevance and can be different. For better understanding figure 3.5 shows part of such an XML document. Taking into consideration XML as the Boltzmannstr. 3 office Garching by Munich Germany 85748 089 189659220 Figure 3.5.: Example of an address in XML using the OAGIS standard [48] 17
3. Related Work underlying data format, [59] propose to extend the XSLT language to generate transformations between different XML documents. The XSLT language (eXtensible Stylesheet Language for Transformations) was originally developed for rendering and transforming XML documents. It allows defining a set of rules for transforming a source tree of an XML document into a target tree. The rules defined with XSLT can then be expressed as an XML document, which allows its validation and parsing through XML parsers. XSLT references to the input tree can be used to create the nodes of the target tree. Figure 3.6 shows an example of such a transformation. Further, XSLT can be extended by using XPath. [48] give the example of an ... ... Figure 3.6.: Example of a one-to-one mapping in XSLT notation [48] address where this becomes necessary. One document might combine the street name and the house number within one field, whereas another document separates these into different elements. XPath expressions such as select="substring-after($addrline,’, ’)" can be used to extract the relevant part. [48] note that there are four possible ways that elements of the different XML documents can be related: one-to-one, one-to-many, many-to-one and many-to-many mappings. Their research transforming documents in xCBL, IOTP, OAGIS, and RETML standard show that 89% of all mappings are one-to-one mappings. However, as also [5] point out, this approach is rather focused on syntactical integration due to reasons described earlier, such as the missing commitment to a domain-specific vocabulary making the names XML tags ambiguous. 3.3.3. Semantic Integration using Ontologies After going over different approaches towards syntactic integration, the challenge of semantic integration of different product data remains. For this, ontologies are a promising approach. Ontologies provide support for integrating heterogeneous and distributed data [7]. Ontologies can be defined as “a formal, explicit specification of a shared conceptualization“ [60, p. 11]. In this definition conceptualization refers to the abstract model of a phenomenon in the world, which describes the relevant concepts of that phenomenon. Defining the type of concepts as well as the constraints on their use makes this model explicit. Being formal makes an ontology machine-readable. This means that ontologies allow machines to understand the semantics of data [7]. [7] further explain that ontologies support the organization, browsing, searching, and more intelligent access to online information and services. They argue that building reusable and agreed-upon product catalogs is at its core building ontologies for the respective domain. [5] discuss the semantic integration of various information sources leveraging ontologies. 18
3. Related Work They explore the shortcomings of XML, mainly arguing that often a common vocabulary is missing, making a direct semantic integration infeasible. They present their approach towards the integration of documents with different structures and vocabularies. This includes the creation of the common vocabulary in the form of ontologies. The integration is done by creating mappings between the semantic terms of ontologies and the structure of XML documents. They review the benefits of this ontology-based approach. Their presented software supports multiple use cases, among others a top-down approach in which the ontology is defined, e.g., by a consortium. Then the respective XML data structure is created from it. The different parties involved use this generated XML data structure to exchange data using the common vocabulary, removing the need for semantic integration. [5] also present the use cases relevant in the scenario discussed in this work as the bottom-up approach. In the bottom-up approach, XML documents with different structures need to be integrated. In this case, the mapping between the data structures is done as shown in Figure 3.7 and described as follows. First, an ontology is created from the structure of the source XML document. The structure of the XML document is referred to as the DTD. This means that concepts and relationships within the XML schema are identified. The focus lies on the reengineering of the conceptual models. In the next step, mapping rules are created between the source and the target ontology. This is done semi-automatically by providing a GUI for the evaluation and application of the automatically generated rules. In the last step, these rules are compiled to a set of XSLT transformations that can be applied to transform an XML document in the source schema into an XML document in the target schema. Figure 3.7.: Structure of the integration approach by [5] [61] present the architecture of an ontology-based approach towards product data integration focusing mainly on the integration between different applications. This is done by adding an ontology layer between the data sources and consumers, e.g., a database and CAD or ERP 19
3. Related Work systems. [61] analyze the advantages of such an approach regarding data exchange. Using a common ontology helps to make the intrinsic semantics of concepts explicit. This allows exchanging information based not only on the syntax of various modeling languages but also on a common understanding of the semantics. Further, they argue that creating this ontology will help organizations structure and reorganize product data more thoughtfully. In addition, the ontology will act as a buffer between different syntactic representations. This is relevant as the syntax of product data changes over time, whereas the semantics usually stay the same. Their work further details its implementation by providing examples and showcasing the used tools DAML and OIL. DAML is a tool used to construct ontologies and create the markup for the exchange of such. OIL is used for the exchange of these ontologies. [58] present an approach towards the semi-automatic integration of existing standards and initiatives for the classification of products and services through ontological mappings. The concept has six steps. First, the standards and joint initiatives for product classification of the respective domain that will be integrated are selected. Next, the knowledge models are semi-automatically extracted from these. Afterward, the relationships between concepts in the different models are identified, and a multi-layered knowledge architecture is manually designed. Based on these mappings, the knowledge models are then integrated. Afterward, the newly created integrated ontology attributes can be enriched using additional information included in the standards. In the last step, the ontology can then be automatically exported into different formats. For the first step, the selection of standards, the work describes the standards UNSPSC, RosettaNet, E-cl@ass, and one additional product catalog from an existing e-commerce platform. Tools including the ontological engineering platform WebODE and its companion for data extraction WebPicker are presented for the next steps. The identification of common concepts between the different standards for creating the mappings is done using a multi-layered approach. [58] explain that ontologies can be classified and layered along with their use case and specificity. They propose to align the integration and mapping with these layers, reducing complexity and allowing for the intra-operability of vertical markets from specialized domains. Apart from these two benefits, using a multi-layered ontology allows reasoning based on the taxonomy of concepts. The mappings between ontologies are done using notions such as equivalence, subclass-of, and union-of. [62] discuss the usage of ontologies for the integration between product design and catalog information. Over the life-cycle of a product, the data used primarily for manufacturing and engineering during the initial design and manufacturing phase needs to be transferred into catalog information targeted towards sales management. The core component for integrating these two phases is the semantic mapping between the two ontologies using description logic. The mapping is initially found using heuristic methods based on names, structures, and types. Considering that both design and sales information belong to the same upper ontology, the range of possible mappings is further reduced. In the last step, the user can adjust the previously automatically created mapping rules. The work goes into detail about each of these components. Among others, the similarity calculations are based on the structure of WordNet. The structural and type matching are done based on the upper ontology between both ontologies. After the correction by the user, the mapping rules enable the automatic 20
You can also read