Data Mining definition services in Cloud Computing with Linked Data
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Data Mining definition services in Cloud Computing with Linked Data Manuel Parra-Royon1 , Ghislain Atemezing2 , and J.M. Benı́tez1 1 Department of Computer Science and Artificial Intelligence, Universidad de Granada, Spain. {manuelparra, j.m.benitez}@decsai.ugr.es arXiv:1806.06826v2 [cs.DC] 24 Jun 2018 2 Mondeca, Paris, France. ghislain.atemezing@mondeca.com Abstract In recent years Cloud Computing service providers have been adding Data Mining (DM) services to their catalog. Several syntactic and seman- tic proposals have been presented to address the problem of the definition and description of services in Cloud Computing in a comprehensive way. Considering that each provider defines its own service logic for DM, we find that using semantic languages and following the linked data proposal it is possible to design a specification for the exchange of data mining ser- vices, achieving a high degree of interoperability. In this paper we propose a schema for the complete definition of DM Cloud Computing services, considering key aspects such as pricing, interfaces, experimentation work- flow, among others. Our proposal leverages the power of Linked Data for validating its usefulness with the definition of various DM services to define a complete Cloud Computing service. 1 Introduction Cloud Computing has been introduced into our daily lives in a completely trans- parent and frictionless way. The ease of Internet access and the exponential increase in the number of connected devices has made it even more popular. Adopting the phenomenon of Cloud Computing means a fundamental change in the way Information Technology services are explored, consumed, deployed or simply used. Cloud Computing is a model of providing services to companies, entities and users, following the utility model, such as energy or gas, among others. The objective is therefore a model of service provision where computer resources and computing power are contracted through the Internet of services [34]. Internet services are a great advantage for an organization, among them you can access it from anywhere and from various devices, also very cost-effective in terms of software, hardware and technical maintenance. But if there is one thing it stands out for, it is scalability, a fundamental feature of any business service. Major Cloud Computing providers are currently leveraging their vast com- puting infrastructure to offer a vast sort of services to enterprises, organizations and users. These services work within a Cloud Computing platform and are 1
thus services ready to be consumed by users. The definition of a service is not only indicated by its functionalities, but also implies specific factors for the consumption of the services by the users such as the methods of access to the service or the interaction, Service Level Agreements (SLA), Service Level Objective (SLO), pricing, authentication, catalogue of services and others. The effectivity of Cloud Computing would be greatly improved if there were a general standar for services definition. The definition of Cloud Computing services lacks a clear standardisation [17]. Each service provider offers a nar- row definition of services, which is generally incompatible with other service providers. For instance, where one provider has a SLA term, another provider has another name and features for that term, although the two might be the same. This makes it difficult to define services or service models independent of the provider as well as to compare services through a service broker [33]. A standardization of the definition of services would boost competitiveness, allow- ing third parties to operate with these services in a totally transparent way, skipping the individual details of the providers. For just a few years now, service providers have been adding specific services for Data Mining. The increase in the volume of data generated by companies and organizations is growing at an extremely high rate. According to Forbes1 , in 2020, the growth is expected to continue and data generation is predicted to increase by up to 4,300%, all motivated by the large amount of data generated by service users. By 2020, it is estimated that more than 25 billion devices will be connected to the Internet, according to Gartner [26], and that they will produce more than 44 billion GB of data annually. In this scenario, Cloud Computing providers have successfully included data analysis services in their catalogue of services, as a demand for massive processing of data. This massive data gathering, storage and processing is known as Big Data [16], and not only takes into account the massive volume of data, but also includes aspects of knowledge extraction, decision making or prediction, among others. These techniques applied to large-scale and ever growing data require a redesign of algorithms [42] and new hardware and software infrastructures to deal with the analysis and processing of data, which would otherwise be impossible to achieve. These services allow the application of Artificial Intelligence (AI) and Ma- chine Learning (ML) techniques on a large variety of data and they offer an extensive catalogue of techniques, algorithms and work-flows related to DM. Services such as Amazon SageMaker 2 , or Microsoft Azure Machine Learning Studio3 among others (Table 1), offer these algorithms as services within cloud platforms. Following this line, other Cloud Computing platforms such as Algo- rithmia 4 or Google Cloud ML5 among others, offer ML services at the highest level, providing specific services for the detection of objects in photographs and video, sentiment analysis, text mining or prediction for instance. There are several proposals for the definition of services. These proposals cover an important variety of both syntactic and semantic languages in order 1 https://www.forbes.com/sites/bernardmarr/2016/04/28/big-data-overload-most- companies-cant-deal-with-the-data-explosion/ 2 https://aws.amazon.com/sagemaker/ 3 https://azure.microsoft.com/en-us/services/machine-learning-studio/ 4 https://algorithmia.com/ 5 https://cloud.google.com/products/machine-learning/ 2
to achieve a correct definition and modelling of services. For the definition of this type of DM services, there is no specific proposal, due to the complexity of the services represented. Solutions based on the proposal offered by Linked Data [7] can solve the problem of defining services from a perspective more comprehensive. Linked Data undertakes models and structures of the Semantic Web, a tech- nology that aims to expose data on the web in a more reusable and interoperable way with other applications, exploiting the relationships that exist between data, both internal and external. The Linked Data proposal allows you to link data and concept definitions from multiple domains through the use of the Semantic Web [5] articulated with RDF [28] or Turtle [3] languages, among others. The Linked Data proposal is based on four key aspects, such as: a) the use of URIs to name things, real world objects and abstract concepts, b) the use of http URIs, which makes it easier to find resources that are published, c) the query of resources RDF with SparQL [41] for the discovery of relationships and enti- ties, d) the inclusion of links to other URIs, which complement the definitions with other external ones, thus maximizing the reuse and interlinking of existing data. It is considered a richly interconnected network of resources, definitions and relationships that can be processed programmatically by machines. The main objective of this work is the definition of DM and Machine Learn- ing (ML) services for Cloud Computing using the Linked Data principles. The definition of the service is not only focused on the service, but also allows the definition and modelling of prices, authentication, SLA or catalogue. For the modelling of the different components of the service, existing schemes and vocab- ularies have been used which have been adapted to the problem of the definition of DM services. In addition, new vocabularies have been created to cover spe- cific elements of the domain of the problem that have not been available until now. We present dmcc-schema, a proposal based on Linked Data to cover the entire definition of DM/ML cloud services. The dmcc-schema proposal provides a complete vocabulary for the exchange, consumption and negotiation of DM services in Cloud Computing. The paper is structured as follows: In the next section we present the related works in Cloud Computing services definition. Section 3 presents out proposal, the dmcc-schema in detail, and depicts all components with its interactions. We describe in section 4 a few cases for a real service, providing of a Random Forest algorithm as service and including each additional aspects related to the service definition in Cloud Computing, such as SLA or prices of the service, among others. Finally, conclusions and future works are included in section 5. 2 Related work DM and ML is a very highly relevant topic nowadays. Those areas of knowledge provides entities, organizations and individuals with tools for data analysis and knowledge extraction at all stages. In addition, if we add to this the increase in Cloud Computing and Edge Computing[43], these DM services are taking a significant position in the catalogue offering by providers. The complexity of DM/ML services requires a complete specification, that is not limited to technical or conceptual considerations. The definition, therefore, must integrate key aspects of the service into the Cloud Computing environment. 3
The definition of services in general has been approached from multiple perspectives. At the syntactic level with the reference of Services-Oriented- Architecture [38] (SoA) and XML [8] (Extended MarkUp Language) derived languages such as WSDL [9], WADL [20], or SoAML [38], UDDI [4] (both with UML [11]), it has been attempted to specify at the technical level the modelling of any Internet service. The definition of services through semantic languages, enables to work with the modeling of services in a more flexible mode. The flexibility of these lan- guages allows to capture functional and non-functional features. OWL-S [37] integrates key factors for these services such as service discovery, process model- ing, and service interoperability details. Web services can be described by using WSMO [12] as well. With WSMO it is possible to describe semantically the dis- covery, invocation and logic aspects of the service. Schemes such as SA-WSDL [30], which integrate both semantic and syntactic schemas, associate semantic annotations to WSDL elements. Some of them, including SA-REST [27], WSMO-lite [31] or Minimal Service Model [44] (MSM), are also part of the group of services for the exchange, au- tomation and composition of services more centred on the final service. On the other hand USDL [29] is a global approach to service modeling that seeks to em- phasize business, entity, price, legal, composition, and technical considerations. With Linked-USDL [40], this idea becomes concrete, in services integrated into the Cloud Computing model. Linked-USDL assumes an important part of the key aspects that a service must provide, extending its usefulness to the ability to create cloud services from scratch. With Linked-USDL, the modelling of entities, catalogue, SLA or interaction among them are considered. The proposals address the modelling and definition of services in a generic mode, without dealing with the specific details of a DM/ML services in Cloud Computing. For the treatment of these problems we typically use software platforms such as Weka [21], Knime [6] or Orange [10] and frameworks and libraries integrated into the most modern and efficient programming languages such as C, Java, Python, R, Scala and others. These environments lack elements for Cloud Computing services modelling due to its nature. The basic idea with this type of modelling is that it allows the DM/ML service on Cloud Computing to be defined. This makes it necessary that must contains semantic aspects for the complete definition. There are several approaches to ontology-based definition languages for DM/ML experimentation and work-flow. For example, Exposé [47] allows you to focus your work on the experimentation work-flow [36]. With OntoDM [39] and using BFO (Basic Formal Ontology), it is aimed to create a framework for data mining experimentation and replication. In MEXcore [15] and MEXalgo [14] a complete specification of the process of description of experimentation in DM problems is made. In addition, together with MEXperf [15], which adds performance mea- sures to the experimentation, the definition of this type of scheme is completed. Another approach similar to MEX is ML-Schema [13], which attempts to estab- lish a standard scheme for algorithms, data and experimentation. ML-Schema consider terms like Task, Algorithm, Implementation, HyperParameter, Data, Features or Model, for instance, as can be found in [1]. The most recent proposals try to unify different schemes and vocabularies previously developed following the Linked Data [5] guidelines. With Linked Data, you can reuse vocabularies, schemes and concepts. This significantly 4
enriches the definition of schemas, allowing you to create the model definition based on other existing schemas and vocabularies. Linked-USDL, MEX[core,algo,perf], ML-Schema, OntoDM or Exposé among others, can be considered when creating a work-flow for DM/ML service definition using Linked Data. These proposals provide the definition of consistency in the main area of the service of DM to- gether with the Linked Data properties, enabling the inclusion of other externals schemas that complement the key aspects of a Cloud Computing service fully defined. 3 dmcc-schema: Data Mining services with Linked Data The Semantic Web applied to the definition of Cloud Computing services, allow tasks such as discovery, negotiation, composition and invocation with a high degree of automation. This automation, based on Linked Data is fundamental in Cloud Computing because it allows services to be discovered and explored for consumption by programmatic entities using the full potential of RDF and SparQL. Linked Data [23] offers a growing body of reusable schemes and vo- cabularies for the definition of Cloud Computing services of any kind. On this basis, the LOV (Linked Data Open Vocabularies) [46] web platform is a good example and its objective is to index schemas and vocabularies within Linked Data, monitored by a group of experts. In this article we propose dmcc-schema, a scheme which has been designed to address the problem of specifying and defining DM/ML services in Cloud Com- puting. Not only it focus on solving the specific problem of modelling, with the definition of experimentation and algorithms, but it also includes the main aspects of a Cloud Computing service well defined. dmcc-schema can be con- sidered as a Linked Data proposal for DM/ML services. Existing Linked Data vocabularies have been integrated into dmcc-schema and new specific vocabular- ies have been created ad-hoc to cover certain aspects that are not implemented by other external schemes. Vocabularies have been re-used following Linked Data recommendations, filling important parts such as the definition of exper- iments and algorithms, as well as the interaction and authentication that were already defined in other vocabularies. A service offered from a Cloud Computing provider, must have the following aspects for its complete specification: • Interaction. Access points and interfaces for services consumption for users and entities. • Authentication. The service or services require authenticated access. • SLA/SLO. The services have service level agreements and more issues. • Prices. The services offered have a cost. • Entities. Services interact between entities offering or consuming services. • Catalogue. The provider has a catalogue of discoverable and consumable services. 5
Table 1: Cloud Computing Data Mining services analysed. Provider Service name Google Cloud Machine Learning Engine Amazon Amazon SageMaker, Amazon Machine Learning IBM Watson Machine Learning, Data Science Microsoft Azure Machine Learning Studio Algorithmia Algorithms bundles Table 2: Vocabularies. Module Vocabularies SLA gr, schema, ccsla Pricing gr, schema, ccpricing, ccinstance, ccregion Authenticacion waa Interaction schema Data Mining mls, ccdm Service Provider gr, schema There is no standardization about what elements a service in Cloud Comput- ing should have for its complete definition. According to NIST 6 , it must meet aspects such as self-service (discovery), measurement (prices, SLA), among oth- ers. For the development of dmcc-schema, a comprehensive study of the features and services available on the DM/ML platforms on Cloud Computing has been carried out. Table 1 contains the providers and the name of the DM/ML ser- vice analysed; leading providers such as Google, Amazon, Microsoft Azure, IBM and Algorithmia, for whom, the SLA and SLO, pricing for the different vari- ants and conditions, the service catalogue (DM/ML algorithms), the methods of interaction with the service, and authentication have been studied. Our scheme and vocabulary have been complemented adding external vocab- ularies: the interaction with the service, using schema.org [19], authentication, with Web Api Authentication (waa) [35], price design, with GoodRelations [24] and experimentation and algorithms with DM/ML , using the dmcc-schema vo- cabulary (mls). Table 2 shows the vocabularies reused in the mls schema to define each module of the full service. The high level diagram of dmcc-schema entities can be seen in the Figure 1. The detailed definition of each entity is developed in the following subsections. With dmcc-schema we have tried to emphasize the independence of the cloud platform, that is, dmcc-schema does not define aspects of service deployment or elements closer to implementation tools. dmcc-schema definition is at the highest level and only addresses the definition and modelling aspects of the service without going into the details of infrastructure deployment. 3.1 SLA: Software Level Agreement The trading of Cloud Computing services sets up a series of contractual agree- ments between the stakeholders involved in the services. Both the provider 6 National Institute of Standards and Technology, www.nist.gov 6
ML Service ML Services hasOfferCatalog Provider Catalog hasMLservice Service Pricing includes ML Service hasInteractionPoint Interaction hasAuthentication hasFunction hasServiceCommitment Service ML Function SLA Authentication Figure 1: Main classes, and relations for dmcc-schema(dmcc). Table 3: MUP term and compensations related. Monthly Uptime Percentage (MUP) Compensation [0.00%, 99.00%] 25% [99.00%, 99.99%] 10% and the consumer of the service must agree on service terms. The service level agreements define technical criteria, relating to availability, response time or error recovery, among others, that the provider must fulfil to deliver the service to the consumer. The SLO are specific measurable features of the SLA such as availability, throughput, frequency, response time, or quality. In addition with the SLO, it needs to contemplate actions when such agreements cannot be achieved and a compensation is offered. The SLA studied for the DM/ML service environment are established by terms and definitions of the agreement ([2], [32], [48] ). These terms may have certain conditions associated with them which, in the case of violation, involve a compensation for the guarantee service. In general terms, SLA for Cloud Computing services is given by a term of the agreement that contains one or more definitions similar to Monthly Uptime Percentage (MUP), which specifies the maximum available minutes less down- time divided by maximum available minutes in a billing month. For this term, a metric or interval is established over which a compensation is applied in the case that the agreement term is not satisfied. In this context a scheme named ccsla 7 has been created for the definition of all the components of an SLA. In Table 3 an example with a pair of MUP intervals and compensation related is shown. In Figure 2 the complete scheme of the SLA modelling that has been designed is illustrated. 3.2 Pricing Like any other utility-oriented service, Cloud Computing services are affected by costs and pricing that vary with the price of the service offered. Following the pay-as-you-go model, the costs of using the service are directly related to the characteristics of the service and the use made of it. The pricing of services in 7 Available in http://lov.okfn.org/dataset/lov/vocabs/ccsla 7
ccsla:Claim ccsla:ServiceCredit schema:structuredValue ccsla:Limitation ccsla:hasTermValue schema:structuredValue ccsla:Term ccsla:hasDefinitionValue ccsla:includesDefs ccsla:hasCompensation ccsla:Definition ccsla:containsTerm SLA ccsla:Condition ccsla:hasValidity ccsla:includesValue time:interval gr:QuantitativeValue gr:QualitativeValue Figure 2: SLA schema for Cloud Computing services (ccsla). Table 4: Price components for each Cloud Computing provider analysed. Provider / Service Region Instances Storage Amazon SageMaker y y y Amazon Lambda y n n Azure Machine Learning y y y Azure Functions y n n IBM BlueMix y y n Google Machine Learning y y n Algorithmia n n n Cloud Computing is a complex task. Not only there is a price plan for temporary use of resources, but it is also affected by technical aspects, configuration or location of the service. Table 4 collects some of the price modifying aspects of the use of the DM/ML service. In the specific segment of data mining services in Cloud Computing there are several elements to consider. The following components of the vocabulary stand out from the diagram of the Figure 3: • ccpricing:PricingPlan Allows you to define the attributes of the price plan and their details. For example free, premium plan, or similar. It must match the attributes of MinPrice and MaxPrice. • gr:PriceSpecification It provides all the tools to define the prices of the services at the highest level of detail. Comes from GoodRelations (gr). • ccpricing:Compound Components of the price of the service. Services are charged according to values that are related to certain attributes of the service configuration. In compound you can add aspects such as type of instances, cost of the region, among others. 8
ccpricing:PricingPlan dmcc:hasPricing Service Pricing ccpricing:MinPrice ccregion:Region ccpricing:MaxPrice ccpricing:hasComponentPrice ccpricing:hasRegion ccpricing:MinCompound gr:PriceSpecification ccpricing:Compound ccpricing:MaxCompound ccpricing:CompornentPrice ccpricing:hasInstance ccinstance:Instance Figure 3: Pricing schema with ccpricing. • ccinstances:Instance Provides the specific vocabulary for the definition of price attributes related to the instance or instances where the service is executed. • ccregions:Region Allows you to use the regions management scheme offered by providers, so that this is an additional value to the costs of the service, as part of the price. For the modelling of this part, three vocabularies supplementing the defini- tion of prices, have been developed which were not available. On the one side ccinstances 8 , which contains everything necessary for the definition of instances (CPU, number of CPU cores, model, RAM or HD, among others), on the other side ccregions 9 , which provides the general modelling of the services regions and location in Cloud Computing and the specific schema for the prices modelling ccpricing 10 . 3.3 Authentication Nowadays, security in computer systems and in particular in Cloud Computing platforms is an aspect that must be taken very seriously when developing Cloud Computing services and applications. In this way, a service that is reliable and robust must also be secure against potential unauthorised access. In the outer layer of security in Cloud Computing services consumption, authentication that should be considered as a fundamental part of a Cloud Computing service definition. Authentication of Cloud services today covers a wide range of possibilities. It should be noted that for the vast majority of services, the most commonly used option for managing user access to services are API Key and OAuth and other mechanisms. waa:WebApiAuthentication has been included for authentication modelling. This scheme allows you to model many of the authentication systems available, not only through the major providers, but also for more discrete environments that do not have OAuth access. In Figure 4, the model for the 8 http://lov.okfn.org/dataset/lov/vocabs/cci 9 http://lov.okfn.org/dataset/lov/vocabs/ccr 10 http://lov.okfn.org/dataset/lov/vocabs/ccp 9
Service Authentication waa:requiresAuthentication waa:hasAuthenticationMechanism waa:ServiceAuthentication waa:AuthenticationMechanism waa:hasCredentials waa:Credentials Figure 4: Authentication schema. definition of authentication in the service is depicted, together with the details of the authentication mechanisms. 3.4 Interaction The interaction with DM/ML services is generally done through a RESTful API. This API provides the basic functionalities of interaction with the service consumer, who must be previously authenticated to use the services identified in this way. For the interaction the Action entity of the vocabulary schema was used. With this definition, the service entry points, methods and interaction variables are fully specified for all services specified by the API. 3.5 Data Mining service For the main part of the service, where the experimentation and execution of al- gorithms is specified and modelled, parts of ML-Schema (mls) have been chosen. MEXcore, OntoDM, DMOP or Exposé also provide an adequate abstraction to model the service, but they are more complex and their vocabulary is more extensive. ML-Schema has been designed to simplify the modelling of DM/ML experiments and bring them into line with what is offered by Cloud Computing providers. We have extended ML-Schema by adapting its model to a specific one and inheriting all its features (Figure 5). The following vocabulary components are highlighted (ccdm is the name of the schema used): • ccdm:MLFunction Sets the operation, function or algorithm to be exe- cuted. For example Random Forest or KNN. • ccdm:MLServiceOutput The output of the algorithm or the execution. Here the output of the experiment is modelled as Model or Model Evalu- ation or a DataSet. • ccdm:MLServiceInput The algorithm input, which corresponds to the set- ting of the algorithm implementation. Here you model the data entry of the experiment, such as the data set and values specific (ccdm:MLServiceInputParameters) to the algorithm being executed. 10
mls:Algorithm mls:realizes MLFunction mls:Model mls:hasOutputFault mls:hasOutput mls:implements mls:executes ccdm:MLServiceOutput mls:ModelEvaluation mls:hasInput mls:hasInputFault ccdm:hasInputParameter mls:Implementation mls:DataSet ccdm:MLServiceInput mls:achieves ccdm:contains ccdm:MLServiceInputParameters mls:Task mls:achieves mls:Data mls:Feature Figure 5: DM experimentation and algorithms execution. • mls:Model Contains information specific to the model that has been gen- erated from the run. • mls:ModelEvaluation Provides the performance measurements of the model’s evolution. • mls:Data They contain the information of complete tables or only at- tributes (table column), only instances (row), or only a single value. • mls:Task It is a part of the experiment that needs to be performed in the DM/ML process. The complete diagram of the dmcc-schema is available on the project web- site11 , from which you can explore the relationships, entities, classes, and at- tributes of each of the modules that comprise the proposal. 4 Use cases and test platform For reasons of space we will not detail all the data or attributes in depth and will only consider what is most important for the basic specification of the service and his comprehension. All detailed information on each component, examples and datasets are available on the project’s complementary information website. In this section we will illustrate how dmcc-schema can be used effectively to define a specific DM service, as well as additional aspects related to Cloud Computing such as SLA, interaction, pricing or authentication. To do this we will create a service instance for a Random Forest [25] algorithm for classification and K-means [22] for clustering. In Random Forest service implementation, we will use the default function model and parametrization of this algorithm in the R [45] language. We have chosen the parametrization format of the R algorithms, due to the great growth that this programming language is currently having for data science and DM/ML. For the instantiation of the service we will use vocabularies such dcterms (dc:) [?], Good Relations (gr:) or schema.org (s:). In these cases of use we will name dmcc-schema as dmcc namespace. The entry point for defining the service is the instantiation of the dmcc:ServiceProvider entity, including its data, using the class dmcc:MLServiceProvider. We will specify an example of DM service hosted on our Cloud Computing platform dicits.ugr.es. 11 http://dicits.ugr.es/linkeddata/dmservices 11
1 _ : MLProvider a dmcc : MLServiceProvider ; 2 rdfs : label " ML Provider " @en ; 3 dc : description 4 " DICITS ML SP " @en ; 5 gr : name " DITICS ML Provider "; 6 gr : legalName " U . of Granada "; 7 gr : hasNAICS "541519"; 8 s : url < http :// www . dicits . ugr . es >; 9 s : serviceLocation 10 [ a s : PostalAddress ; 11 s : addressCountry " ES "; 12 s : addressLocality " Granada "; 13 ] ; 14 s : contactPoint 15 [ 16 a s : ContactPoint ; 17 s : contactType " Costumer Service "; 18 s : availableLanguage [ 19 a s : Language ; 20 s : name " English ";]; 21 s : email " ml@dicits . ugr . es "; 22 ]; 23 dmcc : hasMLService 24 _ : MLServiceDicitsRF , 25 _ : MLSe rvic eDic itsKM eans ; 26 dmcc : hasOfferCatalog 27 _ : ML Se rv ic eDi ci ts Cat al og ; 28 . Listing 1: New DM service definition. The code above shows the specific details of the service provider, such as legal aspects, location or contact information. The service provider dmcc:MLServiceProvider has the property dmcc:hasMLService. With this property two services such as :MLServiceDicitsRF and :MLServiceDicitsKMeans (listing 1 line 24-25) are defined. Listing 2 shows the instantiation of each of the entities for :MLServiceDicitsRF. For the service definition of the :MLServiceDicitsKMeans entities it is done in the similar manner, but using the corresponding properties that differ from the :MLServiceDicitsRF, that is, different point of interaction, different pricing plan, among others, whenever applicable. :MLServiceDicitsRF allows defining all the service entities such as SLA (listing 2 line 8), algorithm (listing 2 line 10), prices (listing 2 line 15), interaction points (listing 2 line 6) and authentication (listing 2 line 12). 1 _ : MLServiceDicitsRF a dmcc : MLService ; 2 rdfs : label 3 " ML Service dicits . ugr . es " @en ; 4 dc : description 5 " DICITS ML Service " @en ; 6 dmcc : hasInteractionPoint 7 _ : MLServiceInteraction ; 8 dmcc : hasServiceCommitment 9 _ : MLServiceSLA ; 12
10 dmcc : hasFunction 11 _ : MLServiceFunction ; 12 dmcc : hasAuthentication 13 _ : MLServiceAuth ; 14 dmcc : hasPricingPlan 15 _ : MLServicePricing ; 16 . Listing 2: Components of the DM service The interaction with the service (3) is performed using the dmcc:Interaction class that includes the property dmcc:hasEntryPoint that allows to define an Action on a resource or object. In this case we use the vocabulary schema.org to set the method of access to the Random Forest service, for which we specify that the service will be consumed through a RESTful API, with HTTP POST ac- cess method, the URL template http://dicits.ugr.es/ml/rf/, for instance (listing 3, line 13), and its parameters (listing 3, from line 14). 1 _ : MLServi ceInteraction 2 a dmcc : Interaction ; 3 rdfs : label " Service Interaction " @en ; 4 dc : description 5 " Service EntryPoint " @en ; 6 dmcc : hasEntryPoint [ a s : Action ; 7 s : target [ 8 a s : EntryPoint ; 9 s : httpMethod " POST "; 10 s : contentType 11 "x - www - form - urlencoded "; 12 s : urlTemplate 13 " http :// dicits . ugr . es / ml / rf /;"; 14 s : formula - input [ 15 a s: PropertyValueSpecification ; 16 s : valueRequired true ; 17 s : valueName " formula "; ]; 18 s : dataset - input [ 19 a s: PropertyValueSpecification ; 20 s : valueRequired true ; 21 s : valueName " data ";]; 22 ... Listing 3: Service interaction. The selected service requires authentication and for modelling, an API KEY authentication has been chosen. The instantiation of service authentication is defined requesting authentication with waa:requiresAuthentication and setting up the authentication mechanism waa:hasAuthenticationMechanism to waa:Direct and as credentials an API KEY is also used as shown in listing 3. 1 _ : MLServiceAuth 2 a dmcc : Serv iceA uthe ntica tion ; 3 rdfs : label 4 " Service Authentication " @en ; 5 dc : description " Service Auth " @en ; 6 waa : re qui re sA ut hen ti ca tio n waa : All ; 13
7 waa : h a s A u t h e n t i c a t i o n M e c h a n i s m 8 [ a waa : Direct ; 9 waa : hasInputCredentials 10 [ a waa : APIkey ; 11 waa : isGroundedIn " key "; 12 ]; 13 waa : w ay Of S en d in gI n fo r ma ti o n 14 waa : ViaURI ; 15 ] 16 . Listing 4: Service authentication. For the definition of the SLA, in our example service we have taken the some providers as Amazon or Azure. The following terms and agreements have been established, a license term called Month Uptime Percentage (MUP) that corresponds to the subtract 100 from the percentage of minutes in the month in which any of the services were unavailable. For the cases where this occurs SLA define two ranges: less than 99.00 is compensated by 30 credits (in service us- age) and the range of 99.00 to 99.99 is compensated with 10 credits. To model the SLA we use ccsla:SLA together with the property ccsla:cointainsTerm :SLATermMUP A; in which we define the specific term in listing 5. 1 _ : SLATermMUP_A a ccsla : Term ; 2 rdfs : label " SLA MUP " @en ; 3 dc : description 4 " SLA Term : Monthly Uptime %" @en ; 5 dc : comment " Calculated by 6 subtracting from 100% the 7 percentage of minutes 8 during ..." @en ; 9 ccsla : hasCompensation 10 _ : SLACompensationMUP_A , 11 _ : SLACompensationMUP_B ; 12 ccsla : includeDefs 13 _ : SLADefinition_A , 14 _ : SLADefinition_B ; 15 . Listing 5: SLA terms and compensation. The definition of the terms and ranges :SLADefinition A, :SLADefinition B, and the associated compensation :SLACompensationMUP A, :SLACompensationMUP B are indicated in line 10 and 11 in listing 5; To define the range of the term is used in :SLADefinition A, as shown in listing 6. 1 _ : SLADefinition_A a ccsla : Definition ; 2 ccsla : hasDefinitionValue [ 3 a s : structuredValue ; 4 s : value [ 5 a s : QuantitativeValue ; 6 s : maxValue 99.99; 7 s : minValue 99.00; 8 s : unitText " Percentaje "; 9 ]; 10 ]; 14
11 . Listing 6: SLA term definition. The scheme for SLA (ccsla) and other examples with the SLA for Amazon and Microsoft Azure can be accessed from the website 12 . For the definition of the economic cost of the service we have considered two variants for the example. The first is for free service use, limited to 250 hours of execution of algorithms within an instance (Virtual Machine) with a CPU model Intel i7, 64 GB of RAM and one region. The second pricing model where you charge what you consume for the service in USD/h., for an instance and one region. In listing 7 we define the two pricing plans (free :PricingPlanFree and ba- sic :PricingPlanRegular, lines 7-8) and in listing 8 we specify the :ComponentsPricePlanFree plan, where the prices components and modifiers are set. The price modelling is done with our proposal for the definition of prices ccpricing. 1 _ : MLServicePricing a 2 ccpricing : ServicePricing ; 3 rdfs : label " Service Pricing " @en ; 4 dc : description 5 " Service Pricing " @en ; 6 ccpricing : hasPricing 7 _ : PricingPlanFree , 8 _ : PricingPlanRegular ; 9 . Listing 7: Service pricing (free plan and regular plan). For each price plan we take into account the variables and features that affect the price. These are: region, instance type and other components for the :ComponentsPricePlanFree (listing 8). 1 _ : C om p on e nt sP r ic e Pl an F re e 2 a ccpricing : Compound ; 3 ccpricing : hasRegion _ : RegionFree ; 4 ccpricing : hasInstance _ : InstanceFree ; 5 ccpricing : MaxCompound _ : MaxUsageFree ; 6 . Listing 8: Pricing components. For example, to define features of the type of instance used in the free plan, we use ccinstance:Instance; and a few attributes like RAM or CPU as seen in listing 9. More examples for the schema ccinstances, including a small dataset of Amazon instances 13 (types T1 and M5 ) are available in the web site of the complementary information of the paper. 1 _ : InstanceFree 2 a ccinstances : Instance ; 3 ccinstances : hasRAM [ 4 a ccinstances : ram ; 5 s : value "64" 6 s : unitCode " E34 "; 12 ccsla: http://dicits.ugr.es/linkeddata/dmservices/#ccsla 13 ccinstances: http://dicits.ugr.es/linkeddata/dmservices/#ccinstances-1 15
7 ] ; 8 ccinstances : hasCPU [ 9 a ccinstances : cpu ; 10 ccinstances : cpu_model 11 " Intel i7 "; 12 ] ; 13 . Listing 9: Pricing and instances. For example, in order to define the :MaxUsageFree, we need to determine the free access plan to the service and the limitation of compute hours to 250. For this purpose we use gr:PriceSpecification and gr:Offering classes as shown in listing 10. 1 _ : MaxUsageFree a 2 gr : PriceSpecification , 3 gr : Offering ; 4 gr : max 0.00; 5 gr : priceCurrency " USD "; 6 gr : includesObject [ 7 a gr : TypeAndQualityNode ; 8 gr : amountOfThisGood "250"; 9 gr : hasUnitOfMeasurement " HRS "; 10 ]; 11 . Listing 10: Price specification for the free plan. For the regular price plan :PricingPlanRegular defined in listing 7 line 8, we use a similar price modelling, but including another type of instance and the same region as shown in listing 11. In this case, the regular price plan offers consumption of instance type :instance02 at a cost of $0.0464 per hour in the :RegionNVirginia region. 1 _ : PricingPlanRegular a ccpricing : Compound ; 2 rdfs : label " Price Components "; 3 rdfs : comment 4 " List of components and limits "; 5 ccpricing : hasRegion 6 _ : RegionNVirginia ; 7 ccpricing : hasInstances 8 _ : instance02 ; 9 ccpricing : withMaxCompound [ 10 a gr : Offering ; 11 gr : ha sPri ceSp ecifi cati on [ 12 a gr : PriceSpecification ; 13 gr : hasCurrency 14 " USD "ˆˆ xsd : string ; 15 gr : hasCurrencyValue 16 "0.0464 USD "ˆˆ xsd : float ; 17 ]; 18 gr : includeObject [ 19 a gr : TypeAndQuantityNode ; 20 gr : amountOfThisGood 21 "1"ˆˆ xsd : integer ; 16
22 gr : hasUnitOfMeasurement 23 " HRS "ˆˆ xsd : string ; 24 ]; 25 ]; 26 . Listing 11: Region and instance pricing. Additional examples for the ccpricing schema, a dataset of Amazon Sage- Maker 14 pricing plans, are available. Finally, we define the service algorithm using ccdm:MLFunction for the def- inition of a Random Forest function, where we specify the input parameters (dataset and hyper-parameters), the output of the algorithm, among others (see listing 12). 1 _ : Rand omFo rest_ Func tion 2 a ccdm : MLFunction ; 3 ccdm : hasInputParameters 4 _ : RF_InputParameters ; 5 mls : hasInput 6 _ : RF_Input ; 7 mls : hasOutput 8 _ : RF_Output . Listing 12: Operations for the algorithm/function. The input and output data of the algorithms must be included in the defi- nition of the data mining operation to be performed. The input of data, which can be parameters :RF InputParameters or datasets :RF Input. Input parameters of the algorithm dmc:MLServiceInputParameters and the parameter list :parameter 01, [...] is shown in listing 13. 1 _ : RF_InputParameters 2 a ccdm : M L S e rv i c e In p u t Pa r a m et e r s ; 3 ccdm : Parameters 4 _ : parameter_01 , 5 [...] 6 dc : description 7 " Input Parameters " ; 8 dc : title " Input " 9 . Listing 13: Input parameters definition. Definition of ccdm:hasInputParameters :RF InputParameters allows you to specify the general input parameters of the algorithm. For example for Ran- dom Forest dc:title "ntrees" (number of trees generated), as well as whether ccdm:mandatory "false" is mandatory and its default value, if it exists. List- ing 5 shows the definition of one of the parameters parameter 01. The other algorithm parameters are defined in the same way. 1 _ : parameter_01 2 a ccdm : M LS e rv i ce In p ut P ar a me te r ; 3 ccdm : defaultvalue "100" ; 4 ccdm : mandatory " false " ; 14 ccpricing: http://dicits.ugr.es/linkeddata/dmservices/#ccpricing-1 17
5 dc : description 6 " Number of trees " ; 7 dc : title " ntrees " . Listing 14: Example of parameter and features. An mls:Model model and an evaluation of the mls:ModelEvaluation model have been considered for specifying the results of the Random Forest service execution in mls:hasOutput :RF Output. Model evaluation is the specific re- sults if the algorithm returns a value or set of values. When the service al- gorithm is preprocessing the result is a dataset. For the model you have to define for example whether the results are PMML [18] for example :RF Model a dmc:PMML Model as shown in listing 15. 1 _ : KMeans_Model 2 a ccdm : PMML_Model ; 3 ccdm : storagebucket 4 < dicits :// models / > ; 5 dc : description 6 " PMML model " ; 7 dc : title " PMML Model " . Listing 15: PMML Model and storage of the service output. The second service offered by our provider, a K-means algorithm (listing 1, line 25), requires a definition :MLServiceDicitsKMeans, following the same pattern as for Random Forest, but in this case you have to generate a set of different input parameter to perform the K-means service. Listing 16 contains an example of two input parameters for the algorithm. 1 _ : parameter_01 2 a ccdm : M LS e rv i ce In p ut P ar am e te r ; 3 ccdm : defaultvalue "" ; 4 ccdm : mandatory " true " ; 5 dc : description 6 " numeric matrix of data " ; 7 dc : title " x " 8 . 9 _ : parameter_02 10 a ccdm : M LS e rv i ce In p ut P ar am e te r ; 11 dmc : defaultvalue "3" ; 12 ccdm : mandatory " true " ; 13 dc : description 14 " either the number of clusters ." ; 15 dc : title " centres " 16 . 17 ... Listing 16: Example of a parameter and features for K-Means. Other services implemented on dmcc-schema as examples, such as Naive- Bayes, LinearRegression, SVM or Optics, are available in the supplementary website of the paper 15 . In addition, to validate the service created, several queries have been made in SparQL, in which, price information, SLA and inter- action on different service configurations has been extracted. 15 http://dicits.ugr.es/linkeddata/dmservices/#ccdm 18
5 Conclusion In this paper, we have presented the dmcc-schema, a scheme and vocabulary to model the definition of DM and ML services in Cloud Computing. It has been built on the Semantic Web using an ontology language and following the Linked Dataproposal. In this proposal, we have encouraged the reuse of external vocabularies and schemes, which enrich our scheme and avoid the redefinition of concepts on Cloud Computing. The use of dmcc-schema for service definition enables the design of inter- operable and lightweight DM services. In addition, it not only focuses on the definition of experiments or algorithms, but also considers fundamental elements for consumption and discovery of services in Cloud Computing platforms. As mentioned above, it covers all aspects necessary for the portability of the defi- nition of services, such as pricing or SLA among others. A key element of dmcc-schema is that it allows you to model a Cloud Com- puting DM provider service, supporting the specific properties and characteris- tics of the service offered. Our scheme has been designed taking into account the different types of concept definition that are present in the the most well-known Cloud Computing providers. In the development of the use cases, where a pair of DM/ML services have been defined. The usefulness of dmcc-schema is confirmed, allowing all key aspects of these specific services to be defined and modelled. Definition of services in dmcc-schema consider key aspects of the service in a simple way and with the appropriate level of detail. Service modelling with dmcc-schema can be used for cloud brokerage ser- vices. To use cloud broker for DM/ML, the design of dmcc-schema allows you to group concepts and terms with different definitions across the different Cloud Computing providers and has the possibility of integrating them into a single scheme. This allows you to abstract each provider’s unique service and work with a service definition that models the differences between them. It thus enables comparison of services of DM/ML in Cloud Computing. As future work, dmcc-schema will be further enhanced to integrate more current Cloud Computing services such as serverless DM/ML related functions and algorithm bundles such us text mining or image classification. It will also be integrated into an DM/ML cloud platform under development, offering DM/ML services using the dmcc-schema service definitions. This platform will allow you to work with the DM/ML service definitions and build the full service from them. Acknowledgment M. Parra-Royon holds a ”Excelencia” scholarship from the Regional Government of Andalucia (Spain). This work was supported by the Research Projects P12- TIC-2958 and TIN2016-81113-R (Ministry of Economy, Industry and Compet- itiveness - Government of Spain). 19
References [1] Mlschema: machine learning and data mining ontologies. http: //ml-schema.github.io/documentation/ML%20Schema.html#mapping. Accessed: 2018-05-20. [2] Bhagyashree Ambulkar and Vaishali Borkar. Data mining in cloud com- puting. In MPGI National Multi Conference, volume 2012, 2012. [3] David Beckett. Turtle-terse rdf triple language. http://www. ilrt. bris. ac. uk/discovery/2004/01/turtle/, 2008. [4] Tom Bellwood, Luc Clément, David Ehnebuske, Andrew Hately, Maryann Hondo, Yin Leng Husband, Karsten Januszewski, Sam Lee, Barbara Mc- Kee, Joel Munter, et al. Uddi version 3.0. Published specification, Oasis, 5:16–18, 2002. [5] Tim Berners-Lee, James Hendler, Ora Lassila, et al. The semantic web. Scientific american, 284(5):28–37, 2001. [6] Michael R. Berthold, Nicolas Cebron, Fabian Dill, Thomas R. Gabriel, To- bias Kötter, Thorsten Meinl, Peter Ohl, Christoph Sieb, Kilian Thiel, and Bernd Wiswedel. KNIME: The Konstanz Information Miner. In Studies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007). Springer, 2007. [7] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data-the story so far. International journal on semantic web and information systems, 5(3):1–22, 2009. [8] Tim Bray, Jean Paoli, C Michael Sperberg-McQueen, Eve Maler, and François Yergeau. Extensible markup language (xml). World Wide Web Journal, 2(4):27–66, 1997. [9] Erik Christensen, Francisco Curbera, Greg Meredith, and Sanjiva Weer- awarana. Web Service Definition Language (WSDL). Technical report, World Wide Web Consortium, March 2001. [10] Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, and Blaž Zupan. Orange: Data mining toolbox in python. Journal of Machine Learning Research, 14:2349–2353, 2013. [11] Alan Dennis, Barbara Haley Wixom, and David Tegarden. Systems analysis and design: An object-oriented approach with UML. John wiley & sons, 2015. [12] John Domingue, Dumitru Roman, and Michael Stollberg. Web service modeling ontology (wsmo)-an ontology for semantic web services, 2005. [13] Diego Esteves, Agnieszka Lawrynowicz, Panče Panov, Larisa Soldatova, Tommaso Soru, and Joaquin Vanschoren. ML Schema Core Specification. Technical report, World Wide Web Consortium, October 2016. 20
[14] Diego Esteves, Diego Moussallem, Ciro Baron Neto, Jens Lehmann, Maria Cláudia Reis Cavalcanti, and Julio Cesar Duarte. Interoperable machine learning metadata using mex. In International Semantic Web Conference (Posters & Demos), 2015. [15] Diego Esteves, Diego Moussallem, Ciro Baron Neto, Tommaso Soru, Ri- cardo Usbeck, Markus Ackermann, and Jens Lehmann. Mex vocabulary: a lightweight interchange format for machine learning experiments. In Pro- ceedings of the 11th International Conference on Semantic Systems, pages 169–176. ACM, 2015. [16] Salvador Garcı́a, Sergio Ramı́rez-Gallego, Julián Luengo, José Manuel Benı́tez, and Francisco Herrera. Big data preprocessing: methods and prospects. Big Data Analytics, 1(1):9, 2016. [17] Souad Ghazouani and Yahya Slimani. A survey on cloud service description. Journal of Network and Computer Applications, 91:61–74, 2017. [18] Alex Guazzelli, Michael Zeller, Wen-Ching Lin, Graham Williams, et al. Pmml: An open standard for sharing models. The R Journal, 1(1):60–65, 2009. [19] R. V. Guha, Dan Brickley, and Steve MacBeth. Schema.org: Evolution of structured data on the web. Queue, 13(9):10:10–10:37, November 2015. [20] Marc J Hadley. Web application description language (wadl). 23, 2006. [21] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):10–18, November 2009. [22] John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979. [23] Tom Heath and Christian Bizer. Linked data: Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology, 1(1):1–136, 2011. [24] Martin Hepp. Goodrelations: An ontology for describing products and services offers on the web. Knowledge Engineering: Practice and Patterns, pages 329–346, 2008. [25] Tin Kam Ho. The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence, 20(8):832–844, 1998. [26] Gartner Inc. Gartner says 6.4 billion connected ’things’ will be in use in 2016. Technical report, 2016. [27] Matthias Klusch. Service discovery. In Encyclopedia of Social Network Analysis and Mining, pages 1707–1717. Springer, 2014. [28] Graham Klyne and Jeremy J Carroll. Resource description framework (rdf): Concepts and abstract syntax. 2006. 21
[29] Srividya Kona, Ajay Bansal, Luke Simon, Ajay Mallya, Gopal Gupta, and Thomas D Hite. Usdl: A service-semantics description language for auto- matic service discovery and composition1. International Journal of Web Services Research, 6(1):20, 2009. [30] Jacek Kopeckỳ, Tomas Vitvar, Carine Bournez, and Joel Farrell. Sawsdl: Semantic annotations for wsdl and xml schema. IEEE Internet Computing, 11(6), 2007. [31] Jacek Kopeckỳ, Tomas Vitvar, Dieter Fensel, and Karthik Gomadam. hrests & microwsmo. STI International, Tech. Rep., 2009. [32] Xiaoyong Li and Junping Du. Adaptive and attribute-based trust model for service-level agreement guarantee in cloud computing. IET Information Security, 7(1):39–50, 2013. [33] Dan Lin, Anna C Squicciarini, Venkata N Dondapati, and Smitha Sun- dareswaran. A cloud brokerage architecture for efficient cloud service se- lection. IEEE Transactions on Services Computing, 2016. [34] Ling Liu. Services computing: from cloud services, mobile services to inter- net of services. IEEE Transactions on Services Computing, 9(5):661–663, 2016. [35] Maria Maleshkova, Carlos Pedrinaci, John Domingue, Guillermo Alvaro, and Ivan Martinez. Using semantics for automating the authentication of web apis. The Semantic Web–ISWC 2010, pages 534–549, 2010. [36] Fabrizio Marozzo, Domenico Talia, and Paolo Trunfio. A workflow man- agement system for scalable data mining on clouds. IEEE Transactions on Services Computing, 2016. [37] David Martin, Mark Burstein, Jerry Hobbs, Ora Lassila, Drew McDermott, Sheila McIlraith, Srini Narayanan, Massimo Paolucci, Bijan Parsia, Terry Payne, et al. Owl-s: Semantic markup for web services. W3C member submission, 22:2007–04, 2004. [38] Eric Newcomer and Greg Lomow. Understanding SOA with Web services. Addison-Wesley, 2005. [39] Pance Panov, Sašo Džeroski, and Larisa Soldatova. Ontodm: An ontol- ogy of data mining. In Data Mining Workshops, 2008. ICDMW’08. IEEE International Conference on, pages 752–760. IEEE, 2008. [40] Carlos Pedrinaci, Jorge Cardoso, and Torsten Leidig. Linked usdl: a vocab- ulary for web-scale service trading. In European Semantic Web Conference, pages 68–82. Springer, 2014. [41] Eric Prud, Andy Seaborne, et al. Sparql query language for rdf. 2006. [42] Sergio Ramı́rez-Gallego, Alberto Fernández, Salvador Garcı́a, Min Chen, and Francisco Herrera. Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with mapreduce. Information Fusion, 42:51–61, 2018. 22
[43] Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. Edge com- puting: Vision and challenges. IEEE Internet of Things Journal, 3(5):637– 646, 2016. [44] Mohsen Taheriyan, Craig A Knoblock, Pedro Szekely, and José Luis Am- bite. Rapidly integrating services into the linked data cloud. In Interna- tional Semantic Web Conference, pages 559–574. Springer, 2012. [45] R Core Team et al. R: A language and environment for statistical comput- ing. 2013. [46] Pierre-Yves Vandenbussche, Ghislain A Atemezing, Marı́a Poveda-Villalón, and Bernard Vatant. Linked open vocabularies (lov): a gateway to reusable semantic vocabularies on the web. Semantic Web, 8(3):437–452, 2017. [47] Joaquin Vanschoren and Larisa Soldatova. Exposé: An ontology for data mining experiments. In International workshop on third generation data mining: Towards service-oriented knowledge discovery (SoKD-2010), pages 31–46, 2010. [48] Zibin Zheng, Jieming Zhu, and Michael R Lyu. Service-generated big data and big data-as-a-service: an overview. In Big Data (BigData Congress), 2013 IEEE International Congress on, pages 403–410. IEEE, 2013. 23
You can also read