A distributed framework for information retrieval, processing and presentation of data
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
2018 22nd International Conference on System Theory, Control and Computing (ICSTCC) A distributed framework for information retrieval, processing and presentation of data Adrian Alexandrescu Department of Computer Science and Engineering Faculty of Automatic Control and Computer Engineering, "Gheorghe Asachi" Technical University of Iași Iași, Romania aalexandrescu@tuiasi.ro Abstract—The Internet is ever growing and the information is opposed to extracting product information which requires a highly unstructured. Finding the best priced product when more complex approach to parsing. shopping online is a difficult task, but shopping search engines help the user identify the desired product. Starting from this In [4] the authors present a distributed web crawler that reality, this paper presents a distributed framework for uses multi-threaded agents to scan the web. There is an information retrieval, processing and presentation of data, which assignment function that makes it so each agent gets has three main components: a data server cluster, a distributed approximately the same number of websites and that set of crawler with a product extraction feature, and a web server for websites changes if an agent activates or deactivates. Our presenting the processed data. The considered use case consists of proposed solution is better because each crawler gets the next extracting information about board and card games from URL to process from a centralized database so there is no need specialized websites and presenting the data in a user-friendly for the reallocation of websites and fault tolerance is achieved website. Efficiency is determined by means of performance by using the database cluster. Other existing crawler solutions evaluation metrics that include crawl and extraction times, and a refer to using idle computing resources [5], sitemaps to find the discussion in terms of scalability. site URLs [6]. Regarding structured data extraction, the authors The high modularization of the framework allows it to be an from [7] present a survey on existing applications and effective teaching tool in classrooms; each module can be easily techniques. There, the authors refer to a commercial data swapped with a custom implementation. The proposed extraction solution whose implementation is a proprietary framework offers students and researchers an environment for secret. In [8] the authors use deep neural network to determine developing and testing a wide range of algorithms and programming solutions, e.g., load balancer, crawler, reverse the template. This approach will be considered in future engineering a web template, product extractor from web pages, improvements of our Template Provider module. data normalization, notification services, web APIs, custom Building an effective shopping search engine must start database. with a stable and especially scalable system architecture. This paper proposes a highly modularized and distributed Keywords—information retrieval, distributed crawler, shopping framework for information retrieval, processing and search engine, data mining, web scraping, micro service architecture, learning framework presentation of data. The context and the issues that arise from tackling the considered problem are discussed in the Problem Statement section. Afterwards, it is presented the proposed I. INTRODUCTION framework and all of its components, while the next section The information available on the Internet grows describes a practical and working use-case scenario, built on continuously and the number of users that have access to the the proposed framework: a shopping search engine for board Internet reached more than 50% of the world population in and card games. Lastly, the conclusions are presented with an 2017 [1]. This means that online businesses have now a emphasis on the future work and research that can be done broader audience to which they can sell their products and using the proposed system. services. From a user's point of view, when it comes to finding In terms of the novelty of this paper, it consists of the high a specific product online, the search can sometimes prove modularization and distribution of the framework and the great difficult, especially if one wants to find the best cost-effective potential for researching and developing new techniques for product. Shopping search engines, like Google Shopping, the different modules (e.g., increasing the system scalability, Shopzilla and PriceGrabber [2], help narrow down the search. product extraction, data normalization). When developing a shopping search engine, the biggest problem is from where and how to gather the product information. II. PROBLEM STATEMENT There is a comprehensive literature when it comes to The considered problem is twofold. Firstly, it is finding an information retrieval in the general sense; this includes the efficient solution for gathering product information from required mathematical model, structuring the data, constructing different shopping websites, processing that information and the index, and computing scores in search systems [3]. Much offering the user a single-access point to finding the products of the existing related work focuses on having a distributed that are more suitable for the user's needs. And secondly, is to crawler and on parsing the web pages to extract words; as design a framework that will allow researchers and students to 978-1-5386-4444-7/18/$31.00 ©2018 IEEE 267
develop, test and evaluate algorithms for the many problems that is proposed is meant to be a practical solution to the that the aforementioned considered scenario entails. The ideal shopping search engine problem. goal is to have a shopping search engine where the user can look for any product or service that is sold on the web. A positive by-product to having a highly modular system is the possibility to have its modules swapped with custom One of the first problems is to have a list of all the shopping implementations. Therefore, another problem that is treated in websites. This means, crawling the web and, for each website, this paper is in terms of teaching and researching solutions to determining if that website sells products. Then, for each common problems that occur in distributed systems. Teaching shopping website, another problem is how to determine which algorithms to students, while offering little practical context, pages contain product information and where exactly on the leads to a decrease in the students' interest and in their ability page is that data. If a website is built from a known shopping to fully understand those algorithms. Having a practical, platform (e.g., BigCommerce, WIX, Weebly, Zen Cart, distributed and modular system allows students to test their OpenCart) [9][10], then the problem becomes simpler as long implementations and see the effects of their solutions to the as the used naming conventions remain the same. For example, overall performance of the system. From a researcher's point of for the WIX platform, the product names can be found on the view, developing a shopping search engine and having a product page in the h2 HTML element having the product- working framework provides multiple opportunities to discover name class. Otherwise, each shopping website has its own better and more efficient solution to the many problems that product page structure and it has to be determined. such a system presents. Each product has a list of mandatory attributes (i.e., name, manufacturer and price) and other optional attributes that III. PROPOSED SOLUTION depend on the product type (e.g., description, size). In order to simplify the problem, the considered attributes are product A. System Architecture name, description, image, price and availability. However, this This paper proposes a novel highly modularized framework restriction does not affect the efficacy of the desired solution. for information retrieval, processing and presentation of data. Communication between the modules is done similarly to the After the product information is extracted from different pipes and filters architectural pattern [13], i.e., the modules are shopping websites, another problem is product normalization, chained so that the output of a module is the input of another i.e., determining if two products taken from different websites one. The proposed solution is designed for gathering product represent in fact the same product. Also, if the product names information, but it can easily be adapted for other web data are written slightly differently, which one of them is the proper extraction by simply swapping a single module. product name. For example, in [11] the authors propose a method for normalizing mobile phone names in Internet The system architecture of the proposed framework is forums. Gathering all this information requires a significant presented in Fig. 1. The overall data flow is as follows: Internet bandwidth, high performance computing capabilities multiple Site Crawlers coordinated by a Crawler Manager and a large amount of storage. Cloud solutions [12] such as gather product information from the web and store that data to Google Cloud Platform or Amazon EC2 provide turn-key a Data Server and to a Product Web Server by means of micro platforms for deploying multiple crawlers, parser, extractors, web services. There are three main components: the distributed and other modules the such a complex system requires. In crawler, the persistent storage (i.e., Data Server) and the order for the system to be efficiently deployed in a cloud product web server used for presentation. Therefore, the environment, one of the most important aspect is scalability. If framework can be seen as a three-tier application where the the system is highly modular and the modules have a high communication between tiers is achieved via micro web degree of independency then there can be as many instances of services and where each component is highly modularized. each module as it is required to achieve maximum efficiency. Each of the modules is discussed in terms of its current In terms of presenting the data to the user, a shopping implementation and on how each one can be improved. In a search engine must be developed and this engine must provide teaching environment, the students can use their own the user product information quickly and accurately. The main implementations for each module. problems here are how to index the data and how to decide which product is higher and which one is lower in the search Initializer. The crawl process is initialized with a seed file, results. This process is similar to what normal search engines which contains the list of sites that are about to be crawled, and use for page ranking. But this also can be refined and fine a config file, which contains general crawl parameters. tuned to the user's needs based on the product characteristics Scheduler. Product prices can change often due to (e.g., color, weight, size, shape) and the user's shopping temporary promotions or permanent increases or decreases in behavior (i.e., the previous purchases). the product cost. Therefore, the crawler must extract As aforementioned, there are many aspects in terms of information at least daily to offer accurate information to the designing and implementing an efficient shopping search clients. The role of the Scheduler module is to trigger the engine. The novel solution presented in this paper focuses on recrawl of the product websites. Currently it is set to recrawl the overall architecture of such a system, the modularization of daily at 8 o'clock, but another approach could be to use an the problem and the relationship between the different system artificial intelligence approach and recrawl each site with a modules. Some of the modules have trivial functionality and different periodicity based on the product types that are being are not the scope of this paper, but the overall system design sold and based on that website’s history of fluctuating prices. 268
Crawler Manager. Once a crawl is scheduled, the Crawler Manager and starts processing it. This component has Instance Manager module of the Crawler Manager decides multiple modules which are discussed next. The framework how many crawler instances are required to finish crawling in a has support for using a third-party crawler, but an adapter has timely manner whilst abiding to the crawl best practices (i.e., to be made in order for the other crawler to be properly used by following the robots.txt file and not making too many requests the Crawler Manager. so that the crawler gets banned). When the required instances are determined, the Load Balancer assigns sites to each Starter. Firstly, it checks if all the pages from that web site crawler. Currently, the Load Balancer is based on the Thread have finished being processed from the previous crawl and sets each page state to waiting. This check is required in order to Pool design pattern: a fixed number of crawlers are available and each crawler processes a single website; after a site is allow multiple crawlers to process the same website; in this processed that crawler processes the next website and so on. case, only the first crawler resets the page status. It makes sense to have multiple crawlers on the same website especially Site Crawler. The largest component of the framework is if each crawler has a different real IP address. the Site Crawler, which receives a web site name from the Config file Seed file Crawler Manager Instance Manager Load Balancer Initializer Scheduler (cron) Site Crawler Data Server 1. Resets the page status (if needed) WebAPI Server: Starter - SiteDAO Call API Update -PageDAO 2. Determine template Template template Crawler WebAPI - ProductDAO Provider Manager -NotificationDAO 3. Fetch with template Update URLs 3.2. Extract 3.2.1. Get URLs Link Fetcher Extractor Extractor 3.1. Fetch web page 3.2.2. Get products Notify of wrong template Database Cluster Product Extractor Web Product Data Field Data Notification Page Filter Extractor Normalizer Service Update product info Store products Product WebAPI WebAPI Manager Manager Call API to update the extracted product data Product Web Server WebAPI Server: Web Site Web Site - SiteDAO Business Presentation Web Database - ProductDAO Logic Logic User Fig. 1. System architecture for the proposed distributed framework for information retrieval, processing and presentation of data 269
Crawler WebAPI Manager. All the communication with for the user to be notified of changes in price based on different the persistent storage is done by means of micro web services criteria. which are discussed in the section describing the Data Server. Template Provider. The module with the biggest research B. WebAPI Server potential is the Template Provider. As the name suggests, this Micro web services employing the REST standard are used module provides the html template of the product pages. in order to keep separately the three main components. There Ideally, this template is determined by analyzing the structure are two WebAPI Servers: one at the Data Server and the other of the pages and determining which pages contain product at the Product Web Server. There are some differences information and where it is on the page (e.g., what is the regarding the HTTP methods used by each of the two servers hierarchy of html tags to get to the product name). At this stage as shown in Table I. of the framework, the template for each site is given in the seed For sake of brevity, the web service methods that deal with file in the form of a json file. price history (/products/id/priceHistory), the crawl reports Fetcher. A critical module is the Fetcher, which retrieves (/reports) and the system notifications (/notifications) are each page from the web and passes that data along with the omitted from the table. template to the Extractor. TABLE I. MICRO WEB SERVICES FOR THE FRAMEWORK RESOURCES HTTP Extractor. The Extractor module allows different modules URL method Description to extract information from the web page. A mandatory implementation for a crawler is the Link Extractor, which /sites GET Returns the list of web sites extracts the URLs from the page and stores them in the /sites POSTa Adds a new web site database. Because the goal is to have a shopping search engine, Returns the site with the specified having a Product Extractor module is also mandatory. This /sites/sid GET id coordinates four modules: Product Page Filter, who determines if the page contains product information, Data /sites/sid/template PUTa Updates the template for the site id Field Extractor, who finds a specific product attribute (e.g., /sites/sid/pages Returns the next waiting to be GETa product name, price) and extracts its value, Data Normalizer, ?state=waiting&limit=1 processed page who removes unnecessary characters from the product attribute /sites/sid/pages POSTa Adds a new page values; this module has a research potential for finding a solution to link products from different websites which have Updates one or more pages (e.g., /sites/sid/pages/[pid] PATCHa page status, retrieved date) different product names but are in fact the same product, and Notification Service, who sends a notification when it detects /products?search=terms GET Performs a product search by name that the existing template cannot be applied to a product page; /products POST a Adds a new product this is usually because of an update on the site structure. /products PATCHb Batch-updates product details WebAPI Manager. Whenever a product is extracted, the a. Method available only at the Data Server, and not at the Product Web Server. WebAPI Manager modules call micro web services to store the b. Method called only from the WebAPI Manager(s), and not from the Web Site Business Logic. product information. Besides updating the Data Server, the manager updates the Product Web Server. This is done so there C. Scalability is a clear separation between the database used by the crawler and the database used by the shopping search engine, in order System scalability is achieved mainly by allowing multiple to increase security and performance. The Product WebAPI crawler instances to process the web sites. The Instance Manager does not send the entire product data if that product is Manager and Load Balancer modules ensure that the system already in the database, but rather it sends only the product handles any load as long as there are computational resources attributes that were updated from the previous crawl. Also, it available. When using a Cloud solution, the system scales has the possibility to send batch updates so fewer calls are seamlessly. The Crawler Manager can be improved by using made to the WebAPI Server. neural network regression [14] to dynamically scale up and down depending on the load at a specific point in time. Data Server. The site, web page and product information are stored in a database cluster by means of a WebAPI Server In order to increase even more the scalability of the system, that uses access tokens for an extra layer of security. The exact each module of the Site Crawler component can be a separate type and implementation for the database cluster is beyond the process on a different computer. The main issue is the time it scope of this paper, but a practical and simple approach is takes to transmit data between the processes, but in certain presented in the Use-Case Scenario section. instances it makes sense to perform this separation. For example, the Template Provider can be easily decoupled to Product Web Server. The user interface consists of a web perform the computing intensive task of determining the server that permits product searches by querying a database by template. Another example is to separate the Fetcher from the means of web services. At this point, the user can only search Extractor, especially if extracting the information takes much by name thanks to the indexing of every word in the product longer than getting the web page content. In this case it is names. Another important feature is the price history for each better for the Fetcher to store the pages in the database and product. This allows for the implementation of the possibility then multiple Extractors to process those pages. 270
The current framework design uses two databases with each page retrieval from a specific site, so not to result in the somewhat duplicate site and product information for security site banning the crawler IP. reasons. Instead, a single database cluster can be used with different authentication tokens for each module that interacts Pages Sites Products with the database by means of the micro web services. PK id PK id PK id FK siteId name FK siteId IV. USE-CASE SCENARIO, RESULTS AND DISCUSSION urlPath urlBase url The considered use-case scenario implies having a list of addedDate logoUrl name known product websites that sell board and card games and retrievedDate seedList description extracting that information so the user can search for the best state pageRegEx imageUrl priced products. The websites that are crawled are: www.lexshop.ro, www.pionul.ro, www.redgoblin.ro and FK templateId price www.regatuljocurilor.ro. All the crawling is done on the same Template addedDate computer in order to accurately measure performance. PK id retrievedDate Regarding the implementation, the framework is written in base PriceHistoryItem availability Java, the Data Server is an Apache Tomcat server with a name inCurrentRun PK id MongoDB database, while the Product Web Server is an App description Engine server [15] deployed on the Google Cloud Platform imageUrl FK productId with Google Datastore [16] for the database. The reasoning for price productUrl using the Google Cloud Platform is that it is free as long as the retrievedDate daily quotas are not exceeded. Optimizations are made using price availability special indexes and memory caches in order to prevent those availability quotas to be exceeded too quickly. For the web services, the Jersey library was used on all three servers. The website for the Fig. 3. Entity-relation diagram showing the main entities of the proposed framework's database board game search engine described in this section is shown in Fig. 2 and in can be accessed at: After a crawler sets all the page states for that site to http://boardgamesearch.h23.ro. waiting, the Template Provider module simply returns a json file containing the selectors (i.e., the template), which were determined by manually analyzing the pages. Then the Fetcher retrieves the first page in waiting state returned by the WebAPI Server. Once the content of the page is received, the Link Extractor finds all the URLs in the page and updates them in the database via the micro web services. In parallel, the Product Extractor uses the template to get the product information, which gets stored on the Data Server and the Product Web Server. Fig. 2. The search page with results for the board game search engine Table II presents a few experimental results from crawling website: http://boardgamesearch.h23.ro the four aforementioned websites. The values shown represent Next is presented the implementation of each framework the average results obtained by running the crawlers five times module and a discussion on the decisions that were made in at different times of day. For this use-case scenario, some order to have a simple yet effective working example. The optimizations were made on the proposed framework in order emphasis is on the behavioral differences of each module to speed up the crawling. Only specific web pages are compared to the descriptions from the previous section. processed because the Product Page Filter allows extraction of products only from multi-product pages. Also, only multi- Regarding the framework's database, the main entities are product page links are added to the database by the Link shown in Fig. 3. For security and performance reasons, the Extractor (which uses the same filter). This way, fewer pages Product Web Server database uses only the Sites table (only the are processed: an average of 1792 products/site were extracted id, name, urlBase and logoUrl attributes), the Products and the with a rate of 161 pages/site. PriceHistoryItem tables. Web services make the access to the databases to be schema-agnostic and either of the two In terms of the extraction times, the link extraction depends databases can easily be relational or non-relational. significantly on the number of links that pass the filter and are stored in the database. For example, at the Red Goblin website The framework uses the config and seed files to run the there are an average of 678 links per page but only 47 relevant crawler daily at 8 o'clock on the four websites. Each time the links. When it comes to products, the extraction and the upload recrawl starts, the Crawler Manager creates four crawlers that times depend on the number of products on each page. run in parallel: one for each web site. In this scenario, there is no need for multiple crawlers to process the same website The total processing time of each web page is the sum of because the waiting time is between 1 and 5 seconds between the times it takes to retrieve the page from the web, extract the links and the products from that page and upload the product information on the remote presentation server. Each page 271
processing takes an average of 2.15 seconds. The time it takes method of uploading the links and products to the database, to process all the four websites is around 92 minutes, and e.g., by using a local cache and making batch uploads. Also, it during that time a total of 7167 products are extracted from 642 is important to find a good solution for the Data Normalizer pages. module. For the presented scenario, the implementation allows TABLE II. EXPERIMENTAL RESULTS OBTAINED BY USING THE FRAMEWORK to have multiple copies of the same product in the database as TO PROCESS FOUR BOARD GAME SHOPPING SITES long as it is from different websites. In terms of the Template Lex Red Regatul Provider, the deep neural networks approach will be Pionul shop goblin jocurilor considered in the future research that will be done to expand Total number of unique 1420 318 1603 3826 our proposed framework. products Total number of 178 35 48 381 REFERENCES processed pages Extracted links counta 281 86 678 268 [1] Internetworldstats.com, "Internet growth statistics", 2018. Available: Extracted links timeab 51 ms 80 ms 47 ms 43 ms https://www.internetworldstats.com/emarketing.htm, [Accessed: 04.04.2018]. Extracted product counta 12 20 51 21 [2] Searchenginewatch.com, " The 10 Best Shopping Engines", 2014. Extracted product timea 73 ms 66 ms 172 ms 50 ms Available: https://searchenginewatch.com/sew/study/2097413/shopping- engines, [Accessed: 04.04.2018]. Retrieval timea 662 ms 711 ms 2741 ms 2485 ms [3] R. R. Larson, "Introduction to information retrieval", Journal of the Product upload timea 330 ms 333 ms 409 ms 346 ms American Society for Information Science and Technology, vol 61, no. 4, pp. 852-853, 2010. 1191 Total processing timea 1116 ms 3370 ms 2924 ms [4] P. Boldi, B. Codenotti, M. Santini and S. Vigna. "Ubicrawler: A scalable ms a. fully distributed web crawler". Software: Practice and Experience, vol. Average values per page. b. 34, no. 8, pp.711-726, 2004. Includes the time it takes to store the link in the database. [5] M. Thelwall. "A Web crawler design for data mining". Journal of The scope of having the board game search engine scenario Information Science, vol. 27, pp. 319-325, 2001 is to show the usefulness of the framework and to emphasize [6] S.B. Brawer, I.B.E.L. Max, R.M. Keller, and N. Shivakumar. "Web its ease of use. Further research will be conducted to find new crawler scheduler that utilizes sitemaps from websites". Google Inc. U.S. Patent 9,355,177. 2016. and better solutions for each of the modules and to have a fully functional shopping search engine. [7] E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. "Web data extraction, applications and techniques: A survey. Knowledge-based systems", vol. 70, pp.301-323, 2014. V. CONCLUSION AND FUTURE WORK [8] T. Gogar, O. Hubacek, and J. Sedivy. "Deep neural networks for web page information extraction". In IFIP International Conference on The scope of the research presented this paper was to Artificial Intelligence Applications and Innovations, pp. 154-163. design and implement a novel modularized distributed Springer, Cham. 2016 framework for information retrieval, processing and [9] Top10ecommercesitebuilders.com, "Best Ecommerce Site Builders of presentation, and to use that framework as a stepping point for 2018", 2018. Available: researching the different aspects that the considered https://www.top10ecommercesitebuilders.com/, [Accessed: 11.05.2018]. environment presents. Besides the highly emphasized [10] Ecommerce-platforms.com, "11 Best Open Source and Free Ecommerce advantage of high modularization, the proposed framework is Platforms for 2018", 2018. Available: https://ecommerce- an effective solution for developing a complex shopping search platforms.com/articles/open-source-ecommerce-platforms, [Accessed: 11.05.2018]. engine. The final goal is to have a fully autonomous system [11] Y. Yao and A. Sun, "Product name recognition and normalization in that detects product websites, correctly extracts products and internet forums". SIGIR Symposium on IR in Practice (SIGIR Industry links same products on different websites so to provide an Track), 2014. excellent user purchase experience in determining the most [12] Q. Zhang, L. Cheng, and R. Boutaba, "Cloud computing: state-of-the-art cost-effective products. and research challenges". Journal of internet services and applications, vol. 1, no. 1, pp.7-18, 2010. In its current form, the framework is a straightforward [13] C. Wulf, N.C. Ehmke and W. Hasselbring, "Toward a generic and shopping search engine solution. On the other hand, the concurrency-aware pipes & filters framework". Symposium on Software potential of the proposed framework is great as it can provide a Performance 2014: Joint Descartes/Kieker/Palladio Days (SOSP 2014), developing environment (for researchers and students alike) for 2014. problems specific to distributed systems (load balancing, fault [14] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. tolerance, replication, scalability), product detection and Patwary, M. Prabhat and R. Adams. "Scalable bayesian optimization extraction from websites, distributed relational and non- using deep neural networks". In International conference on machine learning, pp. 2171-2180, 2015. relational database systems, indexing techniques, notification [15] Cloud.google.com, "Google App Engine", 2018. Available: services, and recommendation systems. https://cloud.google.com/appengine/, [Accessed: 30.03.2018]. Improvements can be made to speed up the extraction [16] I. Shabani, A. Kovaçi and A. Dika. "Possibilities offered by Google App because now the framework implementation of the Extractor Engine for developing distributed applications using datastore". 2014 Sixth International Conference on Computational Intelligence, uses the Jsoup selector-syntax to find the information on the Communication Systems and Networks (CICSyN), pp. 113-118. IEEE, page. A sequential parsing approach would have probably 2014. yielded better results. Another optimization can be made on the 272
You can also read