Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra A Quick Reference Configuration Guide Kris Applegate – kris_applegate@dell.com Solution Architect Dell Solution Centers Dave Jaffe – dave_jaffe@dell.com Solution Architect Dell Solution Centers Armando Acosta – armando_acosta@dell.com Big Data Product Manager Dell Revolutionary Cloud and Big Data Group Rob Wilbert – robert_wilbert@dell.com Solution Architect Dell Solution Centers
Executive Summary This document details the configuration set-up for DataStax Enterprise (DSE) software on the PowerEdge R-Series servers. The intended audiences for this document are customers and solution architects looking for information on configuring DSE clusters within their information technology environment for “always on” transaction processing. The reference configuration introduces the server set-ups that can run the DataStax Enterprise stack. The document will only focus on configuration; it will not go into detail about DSE or Apache Cassandra solution software components or resiliency, performance, or software considerations. This document does not focus on best practices or complete architecture for a DSE Solution. Additional DataStax Enterprise installation, administration, and optimization guides are available on the websites referenced below. Dell developed this document to help streamline configuration for the DataStax Enterprise software. THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. © 2014 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell. Dell, the DELL logo, and the DELL badge are trademarks of Dell Inc. Intel and Xeon are registered trademarks of Intel Corp. Red Hat is a registered trademark of Red Hat Inc. Linux is a registered trademark of Linus Torvalds. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in trademarks and trade names other than its own. 2 Dell Reference Configuration for DataStax Cassandra
Introduction In the age of Big Data, applications operate on a global scale, and they must meet the always-on demands of their developers and their users. DataStax Enterprise is uniquely suited to address the database demands of continuously available, globally distributed on- line applications. Over the last two to three years customers have utilized Hadoop as a tool to help analyze large volumes of structured, semi-structured, and unstructured data. Hadoop is a valuable tool, yet as customers use-cases evolve; new tools are starting to emerge that continue to add more value to the Big Data ecosystem. NoSQL database technologies are a prime example of a new tool being integrated with Hadoop that allow low-latency read/write access to data. Apache Cassandra is one such NoSQL database, and by rewriting the laws of database architecture, Cassandra provides a single database distributed geographically over multiple data centers providing unprecedented levels of reliability. Cassandra’s efficient architecture to capture data at extremely high ingest rates is valuable for Internet- of-Things applications that capture large quantities of time-series data that is then analyzed to provide value to the community of users. DataStax Enterprise enhances the capabilities of Apache Cassandra providing management services to facilitate cluster operations and maintenance. DataStax Enterprise and Hadoop are very complimentary. There are a number of use- cases where NoSQL databases such as DataStax Enterprise, serve as the real-time read/write, always-available database while Hadoop serves as the backend engine to help users analyze large volumes of structured, semi-structured, and unstructured data in more of a batch methodology. Within this integrated data hub, customers can run algorithms on integrated disparate data from relationship database management, enterprise data warehouses, and other sources. Additionally, a data science workbench may be layered on top to provide analytics tools transforming the results into actionable information using search, data visualizations, and reporting/analysis. These new environments are applicable across multiple vertical markets, including Government Intelligence, Healthcare, Financials, Manufacturing, Telco/Media, Retail, Web 2.0, and more. To help support this customer use-case, Dell is partnering with DataStax Enterprise to execute a reference configuration in the Dell Solution Center. DataStax Enterprise is a NoSQL big data platform powered by production-certified Apache Cassandra that is architected for today's line-of-business applications and designed to securely manage real-time, analytic, and search data all in the same database cluster. DataStax Enterprise encapsulates a peer-to-peer distributed architecture model where all nodes inside a cluster are the same. Data is automatically partitioned and distributed among all the nodes. Often, two or more data center locations are used and nodes are distributed among the physical locations. OpsCenter is a global management and monitoring tool that administers Cassandra and DSE clusters. 3 Dell Reference Configuration for DataStax Cassandra
Reference Configuration Apache Cassandra is an open source massively scalable NoSQL (non-relational) database. DataStax is a Dell partner who, in addition to contributing to the Apache Cassandra project, offers a commercialized version in both a community and enterprise flavor. DataStax Enterprise is available for multiple distributions of Linux. This initial configuration will target deployment on bare-metal servers running DataStax Enterprise 3.2.1 on Redhat Linux 6.4. DataStax Enterprise can be used to provide a mechanism to rapidly ingest transactional data to facilitate a variety of emerging workloads. These workloads share a common need to provide a continuously-available, distributed, read/write capable database that does not have any single point of failure. Use Cases for NoSQL Online Data Ingestion: Time-series data Device/Sensor/Data “exhaust” systems Distributed applications Media streaming Online Web retail (transactional, shopping carts, etc.) Online gaming Recommendation engines Real-time data analytics Social media capture and analysis Web click-stream analysis Write-intensive transactional systems The Cassandra ring topology allows multiple nodes to service both read and write requests with a tunable consistency mechanism (both the number of replicas and at what point to acknowledge the write). 4 Dell Reference Configuration for DataStax Cassandra
Figure 1. Logical Diagram of Cassandra Ring Data Node Data Data Node Node Replicate n times Application Read / Write Server(s) Data Data Node Node Server Roles Cassandra Data Node(s) – The data nodes conduct the principle functions in a Cassandra cluster (a cluster contains multiple nodes). In order to provide rapid response times during data ingestion, these nodes are configured to allow for rapid input/output (IO) to disk. As IO arrives the following process commences: 1. Incoming data is assigned to a data node, using a data key determined by hashing the incoming data. Each data node own a specific hash range, and the incoming data is assigned to the data node that owns the hash range the data key falls into 2. IO is written to a disk-based commit log on the assigned node 3. IO is also simultaneously written to a table in memory 4. Steps 1-3 are repeated on one or more additional data nodes in order to meet replication/durability requirements, if any 5. IO is acknowledged back to the requestor 5 Dell Reference Configuration for DataStax Cassandra
This process allows the cluster to maintain a tunable number of replicas across nodes, racks, and datacenters. Since the IO isn’t acknowledged until it is written to a disk-based commit log, the commit log should reside on high-performance storage, such as solid- state drives (SSD). SSDs are common for read-heavy workloads, since reads involve potentially many random IOs. Performance may be increased by adding additional data nodes to the cluster/ring since Cassandra is linearly scalable. Application Server(s) – Application servers reside on the outer edge of the cluster/ring. They are the interface between the Cassandra ring and the outside world. Data may be streamed from an application server programmatically (via APIs for all the popular languages) or through Cassandra’s built-in query language (CQL). DataStax OpCenter Node – The DataStax OpCenter Node runs the management interface. In a production environment, the OpCenter server may need to run on a dedicated physical node; however, for the purposes of this document’s testing, OpCenter was installed on a virtual machine (VM). Figure 2. DataStax OpCenter Interface Node Count Recommendations Dell recognizes that use-cases for Cassandra range from early-stage development and testing clusters through large multi-datacenter installations. Dell and DataStax have services that can help appropriately size a cluster based on customer budget, performance, security, and data consistency requirements. All node-count recommendations are for the Data Nodes only. DataStax OpCenter, application servers, and additional infrastructure services may be needed to complete the environment. 6 Dell Reference Configuration for DataStax Cassandra
As a starting point, three cluster configurations can be defined for typical use: DataStax Recommended Starter Cluster– The low-tier configuration is targeted at basic usage for online database applications, and in some cases, may even be built from existing equipment; however, the performance of these types of clusters can be significantly increased if SSD drives are added. For this configuration, only a single processor is defined. If more services (such as DataStax Search) are added, performance may suffer. DataStax Recommended Standard Cluster – This configuration is a good starting spot for clusters that have the potential to scale. This configuration includes dual processor to improve performance using DataStax’s search capabilities. DataStax Recommended Professional Cluster – This configuration represents the top-tier of hardware recommended to run Cassandra. Adding additional performance to individual nodes (e.g. four processers, additional memory, etc.) will result in diminishing benefit. Rather, adding additional nodes yields a greater return on investment when scaling the cluster. Table 1. Recommended Cluster Sizes DataStax DataStax DataStax Dell Tested Recommended Recommended Recommended Configuration Starter Cluster3 Standard Cluster3 Professional Cluster3 Server Model1 (5) PowerEdge (5) PowerEdge (5) PowerEdge R620 (5) PowerEdge R720 R320 R420 Processor(s) Single Intel Xeon Dual Intel Xeon E5- Dual Intel Xeon E5- Dual Intel Xeon E5- E5-2420 v2 2430 v2 2650 v2 2650 RAM 64 GB 128 GB 256 GB 128 GB 2 Storage (4) 1 TB SATA Drives (6) Intel 3700 Series (6) Intel 3700 Series (6) Intel 3700 Series Read Intensive SSD 400GB 3Gbps SSD 400GB 6Gbps SSD 800GB 6Gbps Application Network (2) Intel X520 DP (2) Intel X520 DP (2) Intel X520 DP (2) Intel X520 DP Cards 10GbE DA/SFP+ 10GbE DA/SFP+ 10GbE DA/SFP+ 10GbE DA/SFP+ Data Switches (2) Dell Networking (2) Dell Force 10 (2) Dell Force 10 (2) Dell Force 10 8164F 10GbE SFP+ S4810 10GbE SFP+ S4810 10GbE SFP+ S4810 10GbE SFP+ Management (2) Dell Networking (2) Dell Networking (2) Dell Networking (2) Dell Force 10 Switches 6248 6248 6248 S60 1GbE Rack Units 9U 9U 9U 14U DataStax DataStax Enterprise DataStax Enterprise DataStax Enterprise DataStax Enterprise Edition Standard Pro Max Standard 1 Any Dell server that is capable of running the supported OSs should work. Selection of these specific models was due to their targeted price brackets 2 SSDs only should be considered for any high-ingestion use-cases 3 The recommended hardware is for Data Nodes only. DataStax OpCenter, application servers, and additional infrastructure services may be needed to complete the environment. 7 Dell Reference Configuration for DataStax Cassandra
Figure 3. Physical networking diagram Tested Configuration For the purposes of this document, a small DataStax cluster was deployed as shown in Table 1. The specific software revisions used in the test are shown in Table 2. The hardware listed should be used as initial guidance only. Additional configurations are possible and will likely be required as each customer’s environment and use-case is unique. Customers should consult with DataStax Professional Services to come up with an optimal design that has been customized to their use-case. Common parameters that could differ include: 1. Node Count – Adding nodes is the best way to scale capacity and performance for a Cassandra cluster. The benefits for adding additional nodes usually outweighs most other efforts to increase disk size and memory amounts in most cases 2. Disks – SSD technology is critical for maintaining the performance necessary to ingest data at a high rate. Keeping both the initial commit log and the sorted string table (SST) disk space on SSDs is strongly recommended 3. Memory – Memory should be sized relative to the use-case. the cluster will benefit from additional memory when using DataStax Solr Search or other memory-intensive features 4. Processors – Data ingestion is not particularly CPU intensive in of itself. However, additional processing power is required as additional capability is added (e.g. Solr Serach, etc.) or as the workload on a DataStax cluster increases 8 Dell Reference Configuration for DataStax Cassandra
Table 2. Software Revisions (As Tested) Component Revision Redhat Enterprise Linux 6.4 DataStax Enterprise 3.2.4 Cassandra Version 1.2.13.2 Integration with Other Solutions For customers interested in using DataStax Cassandra to compliment other Big Data solutions, DataStax Cassandra can act as a low-latency point of ingestion for data which can later be fed to other tools including data warehouses and Dell’s Apache Hadoop solutions for running deep and heavy analytics. Displaying data directly from Cassandra is also possible via Dell’s robust tool-belt of data visualization tools like Dell Kitenga Analytics Suite and the Dell Quest TOAD BI Suite. Figure 4. Physical networking diagram 9 Dell Reference Configuration for DataStax Cassandra
Dell Solution Centers The Dell Solution Centers are a global network of connected labs that allow Dell to help customers architect, validate and build solutions. With multiple footprints in every region, they help customers understand anything from simple hardware platforms, to more complex solutions. These engagements range from an informal 30-60 minute briefing, through a longer half-day workshop, and on to a proof-of-concept that allow customers to kick the tires of their solution prior to signing on the dotted line. Customers may engage with their account team and have them submit a request to take advantage of these free services. Links DataStax Enterprise Cassandra – http://DataStax.com/ Planet Cassandra Community – http://planetcassandra.org/ Apache Cassandra Open Source Project - http://cassandra.apache.org/ 10 Dell Reference Configuration for DataStax Cassandra
You can also read