Search Middleware (Google Infrastructure) - Advanced Distributed Systems - MSc in Advanced Computer Science Gordon Blair ()
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Advanced Distributed Systems Search Middleware (Google Infrastructure) MSc in Advanced Computer Science Gordon Blair (gordon@comp.lancs.ac.uk)
Introducing Google About Google World’s leading search engine (~53.6% of market share) Increasing focus on cloud computing, incl. Google Apps Software as a Service Platform as a Service Another view World’s largest and most ambitious distributed system with (perhaps) around 500,000 machines Extremely challenging requirements in terms of: Scalability Reliability ⇒Great distributed systems Performance case study Openness Adv. Dist. Systems G. Blair/ F. Taiani 2
A Few Words on Scalability Scalability with respect to size e.g. support massive growth in the number of users of a service Scalability with respect to geography e.g. supporting distributed systems that span continents (dealing with latencies, etc) Scalability with respect to administration e.g. supporting systems which span many different administrative organisations Adv. Dist. Systems G. Blair/ F. Taiani 3
Designing for Scalability At the hardware level Use of top end hardware, multi-core processors Adoption of cluster computers At the operating system level Multi-threading High performance protocols, fast context switches, etc At the middleware level Use of caching (client and server side) Use of replication, load balancing, functional decomposition, migration of load to the client Use of P2P techniques, agents See lecture on DS design Adv. Dist. Systems G. Blair/ F. Taiani 5
But remember…. It is not just about scalability Scalability Reliability Performance Openness Adv. Dist. Systems G. Blair/ F. Taiani 6
Physical Infrastructure: From this…. Adv. Dist. Systems G. Blair/ F. Taiani 7
To this…. Adv. Dist. Systems G. Blair/ F. Taiani 8
Scalability: The Google Perspective Focus on data dimension Assume the web consist of over 20 billion web pages at 20kb each = over 400 more terabytes data Now assume you want to read the web One computer can read around 30MB/sec from disk => over 4 months to read the more web queries 1000 machines can do this in < 3 hours better results But More than just reading => crucial need for support … a search/ cloud middleware Adv. Dist. Systems G. Blair/ F. Taiani 9
The Anatomy of a Large-Scale Hypertextual Web Search Engine “The goal of our system is to address many of the problems, both in quality and scalability introduced by scaling search engine technology to such extraordinary numbers.(~20m queries/ day)” Brin and Page, 1998 Adv. Dist. Systems G. Blair/ F. Taiani 10
Distributed Systems Infrastructure at Google Web search GMail Ads system Services and Applications Google Maps Distributed systems infrastructure Cheap PC Hardware Linux Computing platform Physical networking Adv. Dist. Systems G. Blair/ F. Taiani 11
The Computing Platform Adv. Dist. Systems G. Blair/ F. Taiani 12
Some Facts and Figures Each PC is a bulk standard commodity PC running a cut down Linux and featuring around 2 Terabytes of storage A given rack consists of 80 PCs, 40 on each side with redundant Ethernet switches (=> 160 Terabytes) A given cluster may consist of 30 racks (=> 4.8 Petabytes) It is unknown exactly how many clusters Google currently has but say this is but say the number is 200 (=> 960 Petabytes – or just short of 1 Exabyte = 1018 bytes) Also around 500,000 PCs in total 2-3% of PCs replaced per annum due to failure (hardware failures are dwarfed by software failures) Adv. Dist. Systems G. Blair/ F. Taiani 13
Distributed Systems Infrastructure Adv. Dist. Systems G. Blair/ F. Taiani 14
RPC through Protocol Buffers What are protocol buffers? An open source fast and efficient interchange format for serialisation and subsequent storage or tranmission Intended to be simple, fast and efficient (cf. XML) Similar to Facebook’s Thrift Support for RPC Simple style of RPC supported where operations have one argument and one result (cf. REST) Protocol buffers send through RpcChannels and monitored through RpcControllers Must be implemented by the programmer Adv. Dist. Systems G. Blair/ F. Taiani 15
Simple Examples Setting up a message message Point { required int32 x = 1; required int32 y = 2; optional string label = 3; } message Line { required Point start = 1; required Point end = 2; optional string label = 3; } message Polyline { repeated Point point = 1; optional string label = 2; } Setting up an RPC service SomeService{ rpc SomeOp (type1) returns (type2); } Adv. Dist. Systems G. Blair/ F. Taiani 16
Focus on GFS: Requirements Assume that components will fail and hence support constant monitoring and automatic recovery Optimise for very large (multi-GB) files Workload consists mainly of reads (typically large streaming reads) together with some writes (typically large, sequential appends) High sustained bandwidth is more important than latency Support multiple concurrent appends efficiently … and of course scalability Adv. Dist. Systems G. Blair/ F. Taiani 17
GFS Architecture Highlights Each cluster has one master and multiple chunkservers Files replicated on by default three chunkservers System manages fixed sized chunks 64 MB (very large) Over 200 clusters some with up to 5000 machines offering 5 petabytes of data and 40 gigabytes/sec throughput (old data) Adv. Dist. Systems G. Blair/ F. Taiani 18
GFS and Scalability Hardware and operating system Traditional Google approach of very large number of commodity machines Caching strategy No client caching due to i) size of working sets, ii) streaming nature of many data sources, and no server caching - reliance on the Linux buffer cache to maintain files in main memory In memory meta-data on master Distributed systems support Replication as mentioned above Centralised master allows optimum management decisions, e.g. placement Separation of control and data flows (including delegation of consistency management via chunk leases) Relaxed consistency management optimised for appends Adv. Dist. Systems G. Blair/ F. Taiani 19
The Chubby Lock Service What is Chubby? Provides simple file system API Supports coarse grained synchronisation in loosely coupled and large scale distributed systems Used for many purposes As a small scale filing system for, e.g., meta-data As a name-server To achieve distributed consensus/ locking – Use of Lamport’s Paxos Algorithm Adv. Dist. Systems G. Blair/ F. Taiani 20
Passive Replication in Chubby Adv. Dist. Systems G. Blair/ F. Taiani 21
Paxos: Step 1 Electing a coordinator Coordinators can fail => a flexible election process is adopted which can result in multiple coordinators co-existing, with the goal of rejecting messages from old coordinators To identify the right coordinator, an ordering is given to coordinators through the attachment of a sequence number. Each replica maintains the highest sequence number seen so far and, if bidding to be a coordinator, will pick a higher number (which is also unique) and broadcasts this to all replicas in a propose message (effectively bidding to be the coordinator). If other replicas have not seen a higher bidder, they reply with a promise message (indicating that they promise to not deal with other (that is older) coordinators with lower sequence numbers. If a majority of promise messages are received, this replica can act as a coordinator Adv. Dist. Systems G. Blair/ F. Taiani 22
Paxos: Steps 2 and 3 Seeking consensus An elected coordinator may submit a value and then subsequently broadcast an accept message with this value to all replicas (attempting to have this accepted as the agreed consensus). Other replicas either acknowledge this message if they want to accept this as the agreed value, or (optionally) reject this. The coordinator waits, possibly indefinitely in the algorithm, for a majority of replicas to acknowledge the accept message. Reaching consensus If a majority of replicas do acknowledge, then a consensus has effectively been achieved. The coordinator then broadcasts a commit message to notify replicas of this agreement. Adv. Dist. Systems G. Blair/ F. Taiani 23
Paxos Message Exchange If no failures Adv. Dist. Systems G. Blair/ F. Taiani 24
Bigtable What is Bigtable? Bigtable is what it says it is – a very large scale distributed table indexed by row, column and timestamp (built on top of GFS and Chubby) Supports semi-structured and structured data on top of GFS (but without full relational operators, e.g. join) Used by over 60 Google products including Google Earth How does it work? Table split by row into fixed size tablets optimised fro GFS Same client, master, worker pattern as with GFS (workers manage tablets) but also with select (client) caching for tablet location Rows clustered together by naming conventions to optimise access Often used with MapReduce for large scale computation on datasets Adv. Dist. Systems G. Blair/ F. Taiani 25
MapReduce What is MapReduce? A service offering a simple programming model for large scale computation over v. large data-sets, e.g. a distributed grep Hides parallelisation, load balancing, optimisation, failure/ recovery strategies How does it work? Programmer defines several user-level functions: Map: processes a key/ value pair to generate a set of intermediate key/ value pairs (extract something important) Reduce: merges intermediary values (aggregate, summarise, filter,transform) Partition (optional): supports further parallelisation of reduce operations System deals with partitioning, scheduling, communication and failure detection and recovery transparently Adv. Dist. Systems G. Blair/ F. Taiani 26
MapReduce (cont) Adv. Dist. Systems G. Blair/ F. Taiani 27
Additional Reading: Google Lots of material at Google Labs: The Anatomy of a Large-Scale Hypertextual Web Search Engine The Google File System The Chubby Lock Service for Loosely-Coupled Distributed Systems Paxos Made Live – An Engineering Perspective MapReduce: Simplified Data Processing on Large Clusters Bigtable: A Distributed Storage System for Structured Data Interpreting the Data: Parallel Analysis with Sawzall http://research.google.com/pubs/papers.html Adv. Dist. Systems G. Blair/ F. Taiani 28
Expected Learning Outcomes At the end of this session: You should have an appreciation of the challenges of the web search and also the provision of cloud services You should have understanding of the problem of scalability as it applies to the above problems and also the trade-offs with other dimensions (reliability, performance, opennness) You should understand the architecture of Google Infrastructure and how such a middleware solution can support the business You should also be aware of the specific techniques used within Google relating to: GFS Bigtable Chubby MapReduce (and also why they are designed the way they are and how this contributes to delivering the goals of Google) Adv. Dist. Systems G. Blair/ F. Taiani 29
You can also read