ELF: Efficient Lightweight Fast Stream Processing at Scale - Liting Hu, Karsten Schwan, Hrishikesh Amur, and Xin Chen, Georgia Institute of Technology
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ELF: Efficient Lightweight Fast Stream
Processing at Scale
Liting Hu, Karsten Schwan, Hrishikesh Amur, and Xin Chen, Georgia Institute of Technology
https://www.usenix.org/conference/atc14/technical-sessions/presentation/hu
This paper is included in the Proceedings of USENIX ATC ’14:
2014 USENIX Annual Technical Conference.
June 19–20, 2014 • Philadelphia, PA
978-1-931971-10-2
Open access to the Proceedings of
USENIX ATC ’14: 2014 USENIX Annual Technical
Conference is sponsored by USENIX.E LF: Efficient Lightweight Fast Stream Processing at Scale
Liting Hu, Karsten Schwan, Hrishikesh Amur, Xin Chen
Georgia Institute of Technology
{foxting,amur,xchen384}@gatech.edu, schwan@cc.gatech.edu
Abstract App 1 # Top k
0 Final Fatency X
App 2 Product Bundles
Angry Birds Angry Birds Star Wars,
1
Angry Birds Seasons Angry Birds Trilogy App 3
Out Seasons Out
Stream processing has become a key means for gaining 2
Angry Birds
Star Wars
Yes/No
rapid insights from webserver-captured data. Challenges 3 Final Fatency V
Sort
include how to scale to numerous, concurrently running #clicks
streaming jobs, to coordinate across those jobs to share
insights, to make online changes to job functions to adapt
Filter Filter Query?
to new requirements or data characteristics, and for each @item Stream @buys
"Angry Birds"
job, to efficiently operate over different time windows. per minute {"created_at":"Fri Feb 08
01:10:17 +0000 2013",
The E LF stream processing system addresses these new Batch per 5 minute
In "product":"Angry Birds"...}
...
challenges. Implemented over a set of agents enriching
Figure 1: Examples of diverse concurrent applications.
the web tier of datacenter systems, E LF obtains scalabil-
ity by using a decentralized “many masters” architecture the top-k products that have the most clicks. It can then
where for each job, live data is extracted directly from dispatch coupons to those “popular” products so as to in-
webservers, and placed into memory-efficient compressed crease sales. A suitable model for this job is synchronous
buffer trees (CBTs) for local parsing and temporary stor- batch processing, as all log data for the past 300 s has
age, followed by subsequent aggregation using shared to be on hand before grouping clicks per product and
reducer trees (SRTs) mapped to sets of worker processes. calculating the top-k products being viewed.
Job masters at the roots of SRTs can dynamically cus-
For the same set of inputs, a concurrent job performs
tomize worker actions, obtain aggregated results for end
product-bundling, by extracting user likes and buys
user delivery and/or coordinate with other jobs.
from logs, and then creating ‘edges’ and ‘vertices’ linking
An E LF prototype implemented and evaluated for a
those video games that are typically bought together. One
larger scale configuration demonstrates scalability, high
purpose is to provide online recommendations for other
per-node throughput, sub-second job latency, and sub-
users. For this job, since user activity logs are generated in
second ability to adjust the actions of jobs being run.
realtime, we will want to update these connected compo-
nents whenever possible, by fitting them into a graph and
1 Introduction iteratively updating the graph over some sliding time win-
Stream processing of live data is widely used for applica- dow. For this usage, an asynchronous stream processing
tions that include generating business-critical decisions model is preferred to provide low latency updates.
from marketing streams, identifying spam campaigns for The third sale-prediction job states a product
social networks, performing datacenter intrusion detec- name, e.g., Angry Birds, which is joined with the product-
tion, etc. Such diversity engenders differences in how bundling application to find out what products are sim-
streaming jobs must be run, requiring synchronous batch ilar to Angry Birds (indicated by the ‘typically bought
processing, asynchronous stream processing, or combin- together’ set). The result is then joined with the micro-
ing both. Further, jobs may need to dynamically adjust promotion application to determine whether Angry Birds
their behavior to new data content and/or new user needs, and its peers are currently “popular”. This final result can
and coordinate with other concurrently running jobs to be used to predict the likely market success of launching
share insights. Figure 1 exemplifies these requirements, a new product like Angry Birds, and obtaining it requires
where job inputs are user activity logs, e.g., clicks, likes, interacting with the first and second application.
and buys, continuously generated from say, the Video Finally, all of these applications will run for some con-
Games directory in an e-commerce company. siderable amount of time, possibly for days. This makes
In this figure, the micro-promotion application ex- it natural for the application creator to wish to update
tracts user clicks per product for the past 300 s, and lists job functions or parameters during ongoing runs, e.g., to
1
USENIX Association 2014 USENIX Annual Technical Conference 25change the batching intervals to adjust to high vs. low traf- Master 1
fic periods, to flush sliding windows to ‘reset’ job results for App 1
Master 2
for App 2
after some abnormal period (e.g., a flash mob), etc.
Distributed streaming systems are challenged by the
requirements articulated above. First, concerning flex- ELF's DHT
ibility, existing systems typically employ some fixed Master 3
for App 3 Elf node
execution model, e.g., Spark Streaming [28] and oth- Dataflow of Elf
ers [11, 12, 17, 22] treat streaming computations as a Master 4
Dataflow of typical
log processing
series of batch computations, whereas Storm [4] and for App 4 systems
others [7, 21, 24] structure streaming computations as Flume/
Chukwa/ HBase HDFS Spark/Storm
a dataflow graph where vertices asynchronously process Kafka
MapReduce
Webservers
incoming records. Further, these systems are not designed
to be naturally composable, so as to simultaneously pro-
vide both of their execution models, and they do not offer Figure 2: Dataflow of E LF vs. a typical realtime web log
functionality to coordinate their execution. As a result, analysis system, composed of Flume, HBase, HDFS, Hadoop
applications desiring the combined use of their execution MapReduce and Spark/Storm.
models must use multiple platforms governed via external
output the final result. E LF’s operation, therefore, entirely
controls, at the expense of simplicity.
bypasses the storage-centric data path, to rapidly process
A second challenge is scaling with job diversity. Many live data. Intuitively, with a DHT, the masters of differ-
existing systems inherit MapReduce’s “single master/- ent applications will be mapped to different nodes, thus
many workers” infrastructure, where the centralized mas- offering scalability by avoiding the potential bottleneck
ter is responsible for all scheduling activities. How to created by many masters running on the same node.
scale to hundreds of parallel jobs, particularly for jobs E LF is evaluated experimentally over 1000 logical
with diverse execution logics (e.g., pipeline or cyclic webservers running on 50 server-class machines, using
dataflow), different batch sizes, differently sized sliding both batched and continuously streaming workloads. For
windows, etc? A single master governing different per-job batched workload, E LF can process millions of records
scheduling methods and carrying out cross-job coordina- per second, outperforming general batch processing sys-
tion will be complex, creating a potential bottleneck. tems. For a realistic social networking application, E LF
A third challenge is obtaining high performance for can respond to queries with latencies of tens of millisec-
incremental updates. This is difficult for most current ond, equaling the performance of state-of-the-art, asyn-
streaming systems, as they use an in-memory hashtable- chronous streaming systems. New functionality offered
like data structure to store and aggressively integrate past by E LF is its ability to dynamically change job functions
states with new state, incurring substantial memory con- at sub-second latency, while running hundreds of jobs
sumption and limited throughput when operating across subject to runtime coordination.
large-sized history windows. This paper makes the following technical contributions:
The E LF (Efficient, Lightweight, Flexible) stream pro- 1. A decentralized ‘many masters’ architecture assign-
cessing system presented in this paper implements novel ing each application its own master capable of indi-
functionality to meet the challenges listed above within vidually controlling its workers. To the best of our
a single framework: to efficiently run hundreds of con- knowledge, E LF is the first to use a decentralized
current and potentially interacting applications, with di- architecture for scalable stream processing (Sec. 2).
verse per-application execution models, at levels of per- 2. A memory-efficient approach for incremental up-
formance equal to those of less flexible systems. dates, using a B-tree-like in-memory data structure,
As shown in Figure 2, each E LF node resides in each to store and manage large stored states (Sec. 2.2).
webserver. Logically, they are structured as a million- 3. Abstractions permitting cyclic dataflows via feed-
node overlay built with the Pastry DHT [25], where each back loops, with additional uses of these abstrac-
E LF application has its own respective set of master and tions including the ability to rapidly and dynamically
worker processes mapped to E LF nodes, self-constructed change job behavior (Sec. 2.3).
as a shared reducer tree (SRT) for data aggregation. The 4. Support for cross-job coordination, enabling interac-
system operates as follows: (1) each line of logs received tive processing that can utilize and combine multiple
from webserver is parsed into a key-value pair and contin- jobs’ intermediate results (Sec. 2.3).
uously inserted into E LF’s local in-memory compressed 5. An open-source implementation of E LF and a com-
buffer tree (CBT [9]) for pre-reducing; (2) the distributed prehensive evaluation of its performance and func-
key-value pairs from CBTs “roll up” along the SRT, which tionality on a large cluster using real-world web-
progressively reduces them until they reach the root to server logs (Sec. 3, Sec. 4).
2
26 2014 USENIX Annual Technical Conference USENIX AssociationLevel 3 ELF API without the need for coordination via some global entity.
The resulting model supports the concurrent execution
Master 1 Master 2
of diverse applications and flexible coordination between
those applications.
Aggregate Multicast
The third component is the high-level E LF program-
ming API (Figure 3 top) exposed to programmers for
Level 2
Master 1 Master 2
Group 1 Group 2
implementing a variety of diverse, complex jobs, e.g.,
for App 1 for App 2 streaming analysis, batch analysis, and interactive queries.
We next describe these components in more detail.
2.2 CBT-based Agent
Cloud Existing streaming systems like Spark [28], Storm [4] typ-
Agent 1 ...... Agent i ...... Agent n ically consume data from distributed storage like HDFS
Level 1
cbt_i cbt_ii ... cbt_i cbt_ii ... cbt_i cbt_ii ...
Webserver 1 Webserver 1 Webserver 1
or HBase, incurring cross-machine data movement. This
means that data might be somewhat stale when it arrives
Figure 3: High-level overview of the E LF system. at the streaming system. Further, for most realworld jobs,
their ‘map’ tasks could be ‘pre-reduced’ locally on web-
2 Design servers with the most parallelism, and only the interme-
diate results need to be transmitted over the network for
This section describes E LF’s basic workflow, introduces data shuffling, thus decreasing the process latency and
E LF’s components, shows how applications are imple- most of unnecessary bandwidth overhead.
mented with its APIs, and explains the performance, scal- E LF adopts an ‘in-situ’ approach to data access in
ability and flexibility benefits of using E LF. which incoming data is injected into the streaming system
directly from its sources. E LF agents residing in each
2.1 Overview webserver consume live web logs to produce succinct
As shown in Figure 3, the E LF streaming system key-value pairs, where a typical log event is a 4-tuple of
runs across a set of agents structured as an over- timestamp, src ip, prority, body: the timestamp is used
lay built using the Pastry DHT. There are three ba- to divide input event streams into batches of different
sic components. First, on each webserver produc- epochs, and the body is the log entry body, formatted as
ing data required by the application, there is an agent a map from a string attribute name (key) to an arbitrary
(see Figure 3 bottom) locally parsing live data logs array of bytes (value).
into application-specific key-value pairs. For exam- Each agent exposes a simple HiveQL-like [26] query
ple, for the micro-promotion application, batches of interface with which an application can define how to filter
logs are parsed as a map from product name (key) to and parse live web logs. Figure 4 shows how the micro-
the number of clicks (value), and groupby-aggregated promotion application uses E LF QL to define the top-k
like (a, 1), (b, 1), (a, 1), (b, 1), (b, 1) → (a, 2), (b, 3), function, which calculates the top-10 popular products
labelled with an integer-valued timestamp for each batch. that have the most clicks at each epoch (30 s), in the Video
The second component is the middle-level, application- Game directory of the e-commerce site.
specific group (Figure 3 middle), composed of a mas- Each E LF agent is designed to be capable of holding a
ter and a set of agents as workers that jointly imple- considerable number of ‘past’ key-value pairs, by storing
ment (1) the data plane: a scalable aggregation tree such data in compressed form, using a space-efficient,
that progressively ‘rolls up’ and reduces those local key- in-memory data structure, termed a compressed buffer
value pairs from distributed agents within the group, tree (CBT) [9]. Its in-memory design uses an (a, b)-tree
e.g., (a, 2), (b, 3).., (a, 5).., (b, 2), (c, 2).. from tree with each internal node augmented by a memory buffer.
leaves are reduced as (a, 7), (b, 5), (c, 2).. to the root; Inserts and deletes are not immediately performed, but
(2) the control plane: a scalable multicast tree used by the buffered in successive levels in the tree allowing better
master to control the application’s execution, e.g., when
Example log event
necessary, the master can multicast to its workers within {"created_at":"23:48:22 +0000 2013", ELF QL ->
the group, to notify them to empty their sliding windows "id":299665941824950273, SELECT product,SUM(clicks_count)
"product":"Angry Birds Season", FROM *
and/or synchronously start a new batch. Further, different "clicks_count":2, WHERE store == `video_games'
"buys_count":0, GROUP BY product
applications’ masters can exchange queries and results "user":{"id":343284040, SORT BY SUM(clicks_count) DESC
using the DHT’s routing substrate, so that given any appli- "name":"@Curry", LIMIT 10
"location":"Ohio", ...} ...} WINDOWING 30 SECONDS;
cation’s name as a key, queries or results can be efficiently
routed to that application’s master (within O(logN) hops), Figure 4: Example of E LF QL query.
3
USENIX Association 2014 USENIX Annual Technical Conference 27#clicks for the product Angry Birds, by inserting new
Agent
Compressed key-value pairs, and periodically flushing the CBT. The
In-memory PAOs
New k-v pairs insert
CBT
application obtains the local agent’s results in intervals
from [0,5),[5,10), Uncompressed
[10,15)... PAOs [0,5), [0,10), [0,15), etc. as PAO5 , PAO10 , PAO15 , etc. If
... Structured the application needs a windowing value in [5,15), rather
PAOs_5, buffer (can
PAOs_10, flush include both than repeatedly adding the counts in [5,10) with multiple
PAOs_15, above) operations, it can simply perform one single operation
...
PAO15 PAO5 , where is an “invertible reduce”. In
(a) Compressed buffer tree (CBT) another example using synchronous batching, an appli-
Keep inserting new key-value pairs to CBT
cation can start a new batch by erasing past records, e.g.,
[0,5)
flush=PAOs_5 Time Product #Clicks
5 Angry Birds 100
tracking the promotion effect when a new advertisement
flush=PAOs_10
is launched. In this case, all agents’ CBTs coordinate to
Time
10 Angry Birds 150
[0,10)
flush=PAOs_15 15 Angry Birds 250
[0,15) ... ... ... perform a simultaneous “empty” operation via a multicast
... ...
`query sliding window [5,10)' = PAOs_10 PAOs_5 = 50
protocol from the middle-level’s DHT, as described in
`query sliding window [5,15)' = PAOs_15 PAOs_5 = 150 more detail in Sec.2.3.
(b) Arbitrary sliding windows using CBT Why CBTs? Our concern is performance. Consider us-
Figure 5: Between intervals, new k-v pairs are inserted into ing an in-memory binary search tree to maintain key-value
the CBT; the root buffer is sorted and aggregated; the buffer is pairs as the application’s states, without buffering and
the split into fragments according to hash ranges of children, compression. In this case, inserting an element into the
and each fragment is compressed and copied into the respective tree requires traversing the tree and performing the in-
children node; at each interval, the CBT is flushed. sert — a read and a write operation per update, leading
to poor performance. It is not necessary, however, to ag-
I/O performance.
gregate each new element in such an aggressive fashion:
As shown in Figure 5(a), first, each newly parsed key- integration can occur lazily. Consider, for instance, an
value pair is represented as a partial aggregation ob- application that determines the top-10 most popular items,
ject (PAO). Second, the PAO is serialized and the tuple updated every 30 s, by monitoring streams of data from
hash, size, serializedPAO is appended to the root node’s some e-commerce site. The incoming rate can be as high
buffer, where hash is a hash of the key, and size is the as millions per second, but CBTs need only be flushed
size of the serialized PAO. Unlike a binary search tree in every 30 s to obtain the up-to-date top-10 items. The key
which inserting a value requires traversing the tree and to efficiency lies in that “flush” is performed in relatively
then performing the insert to the right place, the CBT large chunks while amortizing the cost across a series of
simply defers the insert. Then, when a buffer reaches small “inserting new data” operations: decompression
some threshold (e.g., half the capacity), it is flushed into of the buffer is deferred until we have batched enough
the buffers of nodes in the next level. To flush the buffer, inserts in the buffer, thus enhancing the throughput.
the system sorts the tuples by hash value, decompresses
the buffers of the nodes in the next level, and partitions 2.3 DHT-based SRT
the tuples into those receiving buffers based on the hash E LF’s agents are analogous to stateful vertices in dataflow
value. Such an insertion behaves like a B-tree: a full leaf systems, constructed into a directed graph in which data
is split into a new leaf and the new leaf is added to the passes along directed edges. Using the terms vertex and
parent of the leaf. More detail about the CBT and its agent interchangeably, this subsection describes how we
properties appears in [9]. leverage DHTs to construct these dataflow graphs, thus
Key for E LF is that the CBT makes possible the rapid obtaining unique benefits in flexibility and scalability. To
incremental updates over considerable time windows, i.e., restate our goal, we seek a design that meets the following
extensive sets of historical records. Toward that end, the criteria:
following APIs are exposed for controlling the CBT: (i)
“insert” to fill, (ii) “flush” to fetch the tree’s entire groupby- 1. capability to host hundreds of concurrently running
aggregate results, and (iii) “empty” to empty the tree’s applications’ dataflow graphs;
buffers, which is necessary when the application wants to 2. with each dataflow graph including minion vertices,
start a new batch. By using a series of “insert”, “flush”, as our vertices reside in distributed webservers; and
“empty” operations, E LF can implement many of standard 3. where each dataflow graph can flexibly interact with
operations in streaming systems, such as sliding windows, others for runtime coordination of its execution.
incremental processing, and synchronous batching. E LF leverages DHTs to create a novel ‘many master’
For example, as shown in Figure 3(b), let the interval decentralized infrastructure. As shown in Figure 6, all
be 5 s, a sale-prediction application tracks the up-to-date agents are structured into a P2P overlay with DHT-based
4
28 2014 USENIX Annual Technical Conference USENIX Associationd70A6A
Micro-promotion App Product-bundling App Sale-predication App
d502fb
(a,7) (a,)
fe4156 hash(`micro-promotion') (d,7) d462ba route query (b,) 43bc3c
JOIN 2080c3
(b,5) (c,) route query
d462ba (c,3) (d,)
hash(`product-bundling')
multicast 'flush' (b,)
2080c3 d4213f (d,7) (c,) (d,)
Agent
(a,7) (a,)
(c,3) Master
(b,3)
(b,2)
JOIN App1's path
JOIN d13da3 App2's path
App3's path
(a,2) (a,5) (b,2) (c,1) (a,) (a,) (c,) (d,)
98fc35 (b,3) (c,2) (d,7) (b,) (d,) (d,)
43bc3c
hash(`sale-predication')
Figure 6: Shared Reducer Tree Construction for many jobs.
routings. Each agent has a unique, 128-bit nodeId in a tree progressively ‘aggregates’ the data until the result ar-
circular nodeId space ranging from 0 to 2128 -1. The set rives at the root. For a non-partitioning function, the agent
of nodeIds is uniformly distributed; this is achieved by as a vertex simply processes the data stream locally us-
basing the nodeId on a secure hash (SHA-1) of the node’s ing the CBT. For a partitioning function like TopKCount
IP address. Given a message and a key, it is guaranteed in which the key-value pairs are shuffled and gradually
that the message is reliably routed to the node with the truncated, we build a single aggregation tree, e.g., Fig-
nodeId numerically closest to that key, within log2b N ure 6 middle shows how the groupby, aggregate, sort
hops, where b is a base with typical value 4. SRTs for functions are applied for each level-i subtree’s root for
many applications (jobs) are constructed as follows. the micro-promotion job. For partitioning functions like
The first step is to construct application-based groups WordCount, we build m aggregation trees to divide the
of agents and ensure that these groups are well balanced keys into m ranges, where each tree is responsible for the
over the network. For each job’s group, this is done as reduce function of one range, thus avoiding the root over-
depicted in Figure 6 left: the agent parsing the applica- load when aggregating a large key space. Figure 6 right
tion’s stream will route a JOIN message using appId as shows how the ‘fat-tree’-like aggregation tree is built for
the key. The appId is the hash of the application’s tex- the product-bundling job.
tual name concatenated with its creator’s name. The hash Cycles. Naiad [20] uses timestamp vectors to realize
is computed using the same collision resistant SHA-1 dataflow cycles, whereas E LF employs multicast services
hash function, ensuring a uniform distribution of appIds. operating on a job’s aggregation tree to create feedback
Since all agents belonging to the same application use the loops in which the results obtained for a job’s last epoch
same key, their JOIN message will eventually arrive at a are re-injected into its sources. Each job’s master has com-
rendezvous node, with nodeId numerically close to ap- plete control over the contents of feedback messages and
pId. The rendezvous node is set as the job’s master. The how often they are sent. Feedback messages, therefore,
unions of all messages’ paths are registered to construct can be used to go beyond supporting cyclic jobs to also
the group, in which the internal node, as the forwarder, exert application-specific controls, e.g., set a new thresh-
maintains a children table for the group containing an old, synchronize a new batch, install new job functionality
entry (IP address and appId) for each child. Note that the for agents to use, etc.
uniformly distributed appId ensures the even distribution
of groups across all agents. Why SRTs? The use of DHTs affords the efficient con-
struction of aggregation trees and multicast services, as
The second step is to “draw” a directed graph within
their converging properties guarantee aggregation or mul-
each group to guide the dataflow computation. Like other
ticast to be fulfilled within only O(logN) hops. Further,
streaming systems, an application specifies its dataflow
a single overlay can support many different independent
graph as a logical graph of stages linked by connectors.
groups, so that the overheads of maintaining a proximity-
Each connector could simply transfer data to the next
aware overlay network can be amortized over all those
stage, e.g., filter function, or shuffle the data using a
group spanning trees. Finally, because all of these trees
portioning function between stages, e.g., reduce function.
share the same set of underlying agents, each agent can
In this fashion, one can construct the pipeline structures
be an input leaf, an internal node, the root, or any combi-
used in most stream processing systems, but by using
nation of the above, causing the computation load well
feedback, we can also create nested cycles in which a new
balanced. This is why we term these structures “shared
epoch’s input is based on the last epoch’s feedback result,
reducer trees” (SRTs).
explained in more detail next.
Implementing feedback loops using DHT-based multi-
Pipeline structures. We build aggregation trees using cast benefits load and bandwidth usage: each message is
DHTs for pipeline dataflows, in which each level of the replicated in the internal nodes of the tree, at each level,
5
USENIX Association 2014 USENIX Annual Technical Conference 29so that only m copies are sent to each internal node’s Worker Process
m children, rather than having the tree root broadcast N Job1's Master Process Input receiver streams
copies to N total nodes. Similarly, coordination across Job builder Job parser
Paos execution CBT
CBT
jobs via the DHT’s routing methods is entirely decen-
SRT proxy
tralized, benefiting scalability and flexibility, the latter Agent 2
Fault
because concurrent E LF jobs use event-based methods to Job tracker
Straggler Worker Process
Job feedback
remain responsive to other jobs and/or to user interaction. Input receiver streams
Job coordinator
Job parser
2.4 E LF API Agent 1
Paos execution CBT
CBT
Job2's Master Process
Subscribe(Id appid) ...... Agent 95 SRT proxy
Agent 3
Vertex sends JOIN message to construct SRT with the root’s nodeid equals Job3's Master Process
to appid. ...... Agent 32 Worker Process
OnTimer() ...... Agent 252
Callback. Invoked periodically. This handler has no return value. The Worker Process
master uses it for its periodic activities. ...... Agent 1070
SendTo(Id nodeid, PAO paos) ......
Vertex sends the key-value pairs to the parent vertex with nodeid, resulting Figure 8: Components of E LF.
in a corresponding invocation of OnRecv.
OnRecv(Id nodeid, PAO paos)
Callback. Invoked when vertex receives serialized key-value pairs from the A sample use shown in Figure 7 contains partial code
child vertex with nodeid. for the micro-promotion application. It multicasts up-
Multicast(Id appid, Message message)
Application’s master publishes control messages to vertices, e.g., synchro-
date messages periodically to empty agents’ CBTs for
nizing CBTs to be emptied; application’s master publishes last epoch’s synchronous batch processing. It multicasts top-k results
result, encapsulated into a message, to all vertices for iterative loops; or
application’s master publishes new functions, encapsulated into a message,
periodically to agents. Upon receiving the results, each
to all vertices for updating functions. agent checks if it has the top-k product, and if true, the
OnMulticast(Id appid, Message message)
Callback. Invoked when vertex receives the multicast message from appli-
extra discount will appear on the web page. To implement
cation’s master. the product-bundling application, the agents subscribe to
multiple SRTs to separately aggregate key-value pairs,
Table 1: Data plane API
and agents’ associated CBTs are flushed only (without
Route(Id appid, Message message)
Vertex or master sends a message to another application. The appid is the
being synchronously emptied), to send a sliding window
hash value of the target application’s name concatenated with its creator’s value to the parent vertices for asynchronously processing.
name.
Deliver(Id appid, Message message)
To implement the sale-predication application, the master
Callback. Invoked when the application’s master receives an outsider mes- encapsulates its query and routes to the other two applica-
sage from another application with appid. This outsider message is usually
a query for the application’s status such as results.
tions to get their intermediate results using Route.
Table 2: Control plane API 3 Implementation
Table 1 and Table 2 show the E LF’s data and control This section describes E LF’s architecture and prototype
plane APIs, respectively. The data plane APIs concern implementation, including its methods for dealing with
data processing within a single application. The control faults and with stragglers.
plane APIs are for coordination between different appli-
3.1 System Architecture
cations.
Figure 8 shows E LF’s architecture. We see that unlike
ArrayList topk;
void OnTimer () { other streaming systems with static assignments of nodes
if (this.isRoot()) { to act as masters vs. workers, all E LF agents are treated
this.Multicast(hash("micro-promotion"), new topk(topk));
this.Multicast(hash("micro-promotion"), new update()); } equally. They are structured into a P2P overlay, in which
} each agent has a unique nodeId in the same flat circu-
void OnMulticast(Id appid, Message message) {
if (message instanceof topk) { lar 128-bit node identifier space. After an application
for(String product: message.topk) { is launched, agents that have target streams required by
if(this.hasProduct(product))
//if it is an topk message, appear discount ... } the application are automatically assembled into a shared
} reducer tree (SRT) via their routing substrates. It is only
//if it is an update message, start a new batch
else if (message instanceof update) { at that point that E LF assigns one or more of the following
//if leaves, flush CBT and update to the parent vertex roles to each participating agent:
if (!this.containsChild(appid)) {
PAO paos = cbt.get(appid).flush(); Job master is SRT’s root, which tracks its own job’s
this.SendTo (this.getParent(appid), paos);
cbt.get(appid).empty(); } execution and coordinates with other jobs’ masters. It has
} four components:
}
• Job builder constructs the SRT to roll up and aggre-
Figure 7: E LF implementation of micro-promotion application gate the distributed PAOs snapshots processed by
6
30 2014 USENIX Annual Technical Conference USENIX Associationlocal CBTs. 3.2 Consistency
• Job tracker detects key-value errors, recovers from
The consistency of states across nodes is an issue in
faults, and mitigates stragglers.
streaming systems that eagerly process incoming records.
• Job feedback is continuously sent to agents for iter-
For instance, in a system counting page views from male
ative loops, including last epoch’s results to be iter-
users on one node and females on another, if one of the
ated over, new job functions for agents to be updated
nodes is backlogged, the ratio of their counts will be
on-the-fly, application-specific control messages like
wrong [28]. Some systems, like Borealis [6], synchronize
‘new discount’, etc.
nodes to avoid this problem, while others, like Storm [4],
• Job coordinator dynamically interacts with other
ignore it.
jobs to carry out interactive queries.
E LF’s consistency semantics are straightforward, lever-
Job worker uses a local CBT to implement some aging the fact that each CBT’s intermediate results (PAOs
application-specific execution model, e.g., asynchronous snapshots) are uniquely named for different timestamped
stream processing with a sliding window, synchronous intervals. Like a software combining tree barrier, each
incremental batch processing with historical records, etc. leaf uploads the first interval’s snapshot to its parent. If
For consistency, job workers are synchronized by the job the parent discovers that it is the last one in its direct list
master to ‘roll up’ the intermediate results to the SRT for of children to do so, it continues up the tree by aggregat-
global aggregation. Each worker has five components: ing the first interval’s snapshots from all branches, else
• Input receiver observes streams. Its current imple- it blocks. Figure 9 shows an example in which agent1
mentation assumes logs are collected with Flume [1], loses snapshot0 , and thus blocks agent5 and agent7 . Pro-
so it employs an interceptor to copy stream events, ceeding in this fashion, a late-coming snapshot eventually
then parses each event into a job-specified key-value blocks the entire upstream path to the root. All snapshots
pair. A typical Flume event is a tuple with times- from distributed CBTs are thus sequentially aggregated.
tamp, source IP, and event body that can be split into
3.3 Fault Recovery
columns based on different key-based attributes.
• Job parser converts a job’s SQL description into a E LF handles transmission faults and agent failures, tran-
workflow of operator functions f, e.g., aggregations, sient or permanent. For the former, the current implemen-
grouping, and filters. tation uses a simple XOR protocol to detect the integrity
• PAOs execution: each key-value pair is represented of records transferred between each source and destina-
as a partial aggregation object (PAO) [9]. New PAOs tion agent. Upon an XOR error, records will be resent.
are inserted into and accumulated in the CBT. When We deal with agent failures by leveraging CBTs and the
the CBT is “flushed”, new and past PAOs are ag- robust nature of the P2P overlay. Upon an agent’s failure,
gregated and returned, e.g., argu, 2, f : count() the dataset cached in the agent’s CBT is re-issued, the
merges with argu, 5, f : count() to be a PAO SRT is re-constructed, and all PAOs are recomputed using
agru, 7, f : count(). a new SRT from the last checkpoint.
• CBT resides in local agent’s memory, but can be In an ongoing implementation, we also use hot repli-
externalized to SSD or disk, if desired. cation to support live recovery. Here, each agent in
• SRT proxy is analogous to a socket, to join the P2P the overlay maintains a routing table, a neighborhood
overlay and link with other SRT proxies to construct set, and a leaf set. The neighborhood set contains the
each job’s SRT. nodeIds hashed from the webservers’ IP addresses that
A key difference to other streaming systems is that E LF are closest to the local agent. The job master periodi-
seeks to obtain scalability by changing the system archi- cally checkpoints each agent’s snapshots in the CBT, by
tecture from 1 : n to m : n, where each job has its own asynchronously replicating them to the agent’s neighbors.
master and appropriate set of workers, all of which are PAOs' snapshots
mapped to a shared set of agents. With many jobs, there- sequentially sent Current Timestamp: 3
... 1.K-V error
fore, an agent act as one job’s master and another job’s 21 0 1 3. Resend
worker, or any combination thereof. Further, using DHTs, 2. Blocked
4. Leap 0
5
...
jobs’ reducing paths are constructed with few overlaps, 2
2 1 0 Blocked
7
resulting in E LF’s management being fully decentralized ... 3 1 0 Job master
6
and load balanced. The outcome is straightforward scal- 2
2
ing to large numbers of concurrently running jobs, with ... 4
nodeId
each master controls its own job’s execution, including
to react to failures, mitigate stragglers, alter a job as it is Figure 9: Example of the leaping straggler approach. Agent1
running, and coordinate with other jobs at runtime. notifies all members to discard snapshot0 .
7
USENIX Association 2014 USENIX Annual Technical Conference 31E LF’s approach to failure handling is similar to that • What is the overheads of E LF in terms of CPU, mem-
of Spark and other streaming systems, but has potential ory, and network load (Sec.4.3)?
advantages in data locality because the neighborhood
set maintains geographically close (i.e., within the rack) 4.1 Testbed and Application Scenarios
agents, which in turn can reduce synchronization over- Experiments are conducted on a testbed of 1280 agents
heads and speed up the rebuild process, particularly in hosted by 60 server blades running Linux 2.6.32, all con-
datacenter networks with limited bi-section bandwidths. nected via Gigabit Ethernet. Each server has 12 cores
(two 2.66GHz six-core Intel X5650 processors), 48GB of
3.4 Straggler Mitigation DDR3 RAM, and one 1TB SATA disk.
Straggler mitigation, including to deal with transient slow- E LF’s functionality is evaluated by running an actual
down, is important for maintaining low end-to-end delays application requiring both batch and stream processing.
for time-critical streaming jobs. Users can instruct E LF The application’s purpose is to identify social spam cam-
jobs to exercise two possible mitigation options. First, as paigns, such as compromised or fake OSN accounts used
in other stream processing systems, speculative backup by malicious entities to execute spam and spread mal-
copies of slow tasks could be run in neighboring agents, ware [13]. We use the most straightforward approach to
termed the “mirroring straggler” option. The second identify them – by clustering all spam containing the same
option in actual current use by E LF is the “leaping strag- label, such as an URL or account, into a campaign. The
gler” approach, which skips the delayed snapshot and application consumes the events, all labeled as “sales”,
simply jumps to the next interval to continue the stream from the replay of a previously captured data stream from
computation. Twitter’s public API [3], to determine the top-k most fre-
Straggler mitigation is enabled by the fact that each quently twittering users publishing “sales”. After listing
agent’s CBT states are periodically checkpointed, with a them as suspects, job functions are changed dynamically,
timestamp at every interval. When a CBT’s Paos snap- to investigate further, by setting different filtering condi-
shots are rolled up from leaves to root, the straggler will tions and zooming in to different attributes, e.g., locations
cause all of its all upstream agents to be blocked. In the or number of followers.
example shown in Figure 9, agent1 has a transient failure The application is implemented to obtain online results
and fails to resend the first checkpoint’s data for some from live data streams via E LF’s agents, while in addi-
short duration, blocking the computations in agent5 and tion, also obtaining offline results via a Hadoop/HBase
agent7 . Using a simple threshold to identify it as a strag- backend. Having both live and historical data is crucial
gler – whenever its parent determines it to have fallen for understanding the accuracy and relevance of online
two intervals behind its siblings – agent1 is marked as a results, e.g., to debug or improve the online code. E LF
straggler. Agent5 , can use the leaping straggler approach: makes it straightforward to mix online with offline pro-
it invalidates the first interval’s checkpoints on all agents cessing, as it operates in ways that bypass the storage tier
via multicast, and then jumping to the second interval. used by Hadoop/HBase.
The leaping straggler approach leverages the streaming Specifically, live data streams flow from webservers
nature of E LF, maintaining timeliness at reduced levels of to E LF and to HBase. For the web tier, there are 1280
result accuracy. This is critical for streaming jobs operat- emulated webservers generating Twitter streams at a rate
ing on realtime data, as when reacting quickly to changes of 50 events/s each. Those streams are directly intercepted
in web user behavior or when dealing with realtime sensor by E LF’s 1280 agents, that filter tuples for processing, and
inputs, e.g., indicating time-critical business decisions or concurrently, unchanged streams are gathered by Flume
analyzing weather changes, stock ups and downs, etc. to be moved to the HBase store. The storage tier has 20
servers, in which the name node and job tracker run on a
4 Evaluation master server, and the data node and task trackers run on
the remaining machines. Task trackers are configured to
E LF is evaluated with an online social network use two map and two reduce slots per worker node. HBase
(OSN) monitoring application and with the well-known coprocessor, which is analogous to Google’s BigTable
WordCount benchmark application. Experimental evalu- coprocessor, is used for offline batch processing.
ations answer the following questions:
Comparison with Muppet and Storm. With ad-hoc
• What performance and functionality benefits does queries sent from a shell via ZeroMQ [5], comparative
E LF provide for realistic streaming applications results are obtained for E LF, Muppet, and Storm, with
(Sec.4.1)? varying sizes of sliding windows. Figure 10a shows that
• What is the throughput and processing latency seen E LF consistently outperforms Muppet. It achieves per-
for E LF jobs, and how does E LF scale with number formance superior to Storm for large window sizes, e.g.,
of nodes and number of concurrent jobs (Sec.4.2)? 300 s, because the CBT’s data structure stabilizes ‘flush’
8
32 2014 USENIX Annual Technical Conference USENIX Association5s window 600
150 8000
Latency (ms)
10s window Average latency 561.3ms
Function changing time (ms)
30s window 6000
500
120 60s window 4000
Processing time (ms)
300s window 400 2000 Fault recovery
90 0
1 2 4 8 16 32 64 128
300 #Fault agents
60 204.6ms 200
Latency (ms)
200 Leap straggler
150
30 100 100
7.3ms 50
0 0
0 Message Filter func Map/Reduce func 1 2 4 8 16 32 64 128
Storm Muppet Elf New-updated function #Straggler agents
(a) Comparison of processing times. (b) Latency for job function changes. (c) Fault and straggler recovery.
Figure 10: Processing times, latency of function changes, and recovery results of E LF on the Twitter application.
cost by organizing the compressed historical records in are first consumed by CBTs, and the SRT only picks up
an (a, b) tree, enabling fast merging with large numbers truncated key-value pairs from CBTs for subsequent shuf-
of past records. fling. Therefore, the CBT, as the starting point for parallel
streaming computations, directly influences E LF’s overall
Job function changes. Novel in E LF is the ability to
throughput. The SRT, as the tree structure for shuffling
change job functions at runtime. As Figure 10b shows,
key-value pairs, directly influences total job processing
it takes less than 10 ms for the job master to notify all
latency, which is the time from when records are sent to
agents about some change, e.g., to publish discounts in
the system to when results incorporating them appear at
the microsale example. To zoom in on different attributes,
the root. We first report the per-node throughput of E LF
the job master can update the filter functions on all agents,
in Figure 11, then report data shuffling times for different
which takes less than 210 ms, and it takes less than 600
operators in Figure 12a 12b. The degree to which loads
ms to update user-specified map/reduce functions.
are balanced, an important E LF property when running a
Fault and Straggler Recovery. Fault and straggler re- large number of concurrent streaming jobs, is reported in
covery are evaluated with methods that use human inter- Figure 12c.
vention. To cause permanent failures, we deliberately
remove some working agents from the datacenter to eval- Throughput. E LF’s high throughput for local aggrega-
uate how fast E LF can recover. The time cost includes tion, even with substantial amounts of local state, is
recomputing the routing table entries, rebuilding the SRT based in part on the efficiency of the CBT data struc-
links, synchronizing CBTs across the network, and re- ture used for this purpose. Figure 11 compares the
suming the computation. To cause stragglers via tran- aggregation performance and memory consumption of
sient failures, we deliberately slow down some working the Compressed Buffer Tree (CBT) with a state-of-the-
agents, by collocating them with other CPU-intensive art concurrent hashtable implementation from Google’s
and bandwidth-aggressive applications. The leap ahead sparsehash [2]. The experiment uses a microbenchmark
method for straggler mitigation is fast, as it only requires running the WordCount application on a set of input files
the job master to send a multicast message to notify ev- containing varying numbers of unique keys. We measure
eryone to drop the intervals in question. the per-unique-key memory consumption and throughput
Figure 10c reports recovery times with varying num- of the two data structures. Results show that the CBT con-
bers of agents. The top curve shows that the delay for sumes significantly less memory per key, while yielding
fault recovery is about 7 s, with a very small rate of in- similar throughput compared to the hashtable. Tests are
crease with increasing numbers of agents. This is due to run with equal numbers of CPUs (12 cores), and hashtable
the DHT overlay’s internally parallel nature of repairing 150 150
the SRT and routing table. The bottom curve shows that CBT Sparsehash
(×106 keys/s) per key (B)
120 120
Memory
the delay for E LF’s leap ahead approach to dealing with 90
60
90
60
stragglers is less than 100 ms, because multicast and sub- 30 30
0 0
sequent skipping time costs are trivial compared to the 20 20
cost of full recovery.
Throughput
16 16
12 12
8 8
4.2 Performance 4 4
0 0
20 60 100 140 180 20 60 100 140 180
Data streams propagate from E LF’s distributed CBTs as
Number of unique keys Number of unique keys
leaves, to the SRT for aggregation, until the job master at (×106 ) (×106 )
the SRT’s root has the final results. Generated live streams Figure 11: Comparison of CBT with Google Sparsehash.
9
USENIX Association 2014 USENIX Annual Technical Conference 33100 100 0.9999
max operator max operator
Data-shuffling time (ms) min operator min operator 0.999
Data-shuffling time (ms)
80 sum operator 80 sum operator 0.99
Normal percentiles
avg operator avg operator 0.9
60 50 ms 60 50 ms
0.5
40 40
0.1 Deploying 500 SRTs
20 20 0.01 Deploying 1000 SRTs
0.001 Deploying 2000 SRTs
0.0001 Reference line
0 0
10 20 40 80 160 320 640 1280 10 20 40 80 160 320 640 1280 0 5 10 15
#Agents in datacenter #Agents in datacenter Average #roots of SRTs per agent
(a) Running operators separately. (b) Running operators simultaneously. (c) Normal probability plot of #roots.
Figure 12: Performance evaluation of E LF on data-shuffling time and load balance.
CPU Memory I/O C-switch 3 600
#Packets per agent per second
SPS
Bytes per agent per second
%used %used wtps cswsh/s 2.5 500
E LF 2.96% 5.73% 3.39 780.44 2 400
Flume 0.14% 5.48% 2.84 259.23
1.5 300
S-master 0.06% 9.63% 2.96 652.22
S-worker 1.17% 15.91% 11.47 11198.96 1 200
5s interval 5s interval
SPS: stream processing system. 0.5 10s interval 100 10s interval
wtps: write transactions per second. 30s interval 30s interval
0 0
10 20 40 80 160 320 640 1280 10 20 40 80 160 320 640 1280
cswsh/s: context switches per second. #Agents in datacenter #Agents in datacenter
(a) Runtime overheads of E LF vs. others. (b) Additional #packets overhead. (c) Additional bytes overhead.
Figure 13: Overheads evaluation of E LF on runtime cost and network cost.
performance scales linearly with the number of cores. times are strictly governed by an SRT’s depth O(log16 N),
E LF’s per-node throughput of over 1000,000 keys/s where N is the number of agents in the datacenter.
is in a range similar to Spark Streaming’s reported best
Balanced Load. Figure 12c shows the normal probabil-
throughput (640,000 records/s) for Grep, WordCount,
ity plot for the expected number of roots per agent. These
and TopKCount when running on 4-core nodes. It is
results illustrate a good load balance among participating
also comparable to the speeds reported for commercial
agents when running a large number of concurrent jobs.
single-node streaming systems, e.g., Oracle CEP reports a
Specifically, assuming the root is the agent with the high-
throughput of 1 million records/s on a 16-core server and
est load, results show that 99.5% of the agents are the
StreamBase reports 245,000 records/s on a 8-core server.
roots of less than 3 trees when there are 500 SRTs total;
Operators. Figure 12a 12b reports E LF’s data shuffling 99.5% of the agents are the roots of less than 5 trees when
time of E LF when running four operators separately vs. there are 1000 SRTs total; and 95% of the agents are the
simultaneously. By data shuffling time, we mean the time roots of less than 5 trees when there are 2000 SRTs total.
from when the SRT fetches a CBT’s snapshot to the result This is because of the independent nature of the trees’ root
incorporating it appears in the root. max sorts key-value IDs that are mapped to specific locations in the overlay.
pairs in a descending order of value, and min sorts in an
ascending order. sum is similar to WordCount, and avg 4.3 Overheads
refers to the frequency of words divided by the occurrence
We evaluate E LF’s basic runtime overheads, particularly
of key-value pairs. As sum does not truncate key-value
those pertaining to its CBT and SRT abstractions, and
pairs like max or min, and avg is based on sum, naturally,
compare them with Flume and Storm. The CBT requires
sum and avg take more time than max and min.
additional memory for maintaining intermediate results,
Figure 12a 12b demonstrates low performance inter-
and the SRT generates additional network traffic to main-
ference between concurrent operators, because both data
tain the overlay and its tree structure. Table 13a and
shuffling times seen for separate operators and concurrent
Figure 13b 13c present these costs, explained next.
operators are less than 100 ms. Given the fact that concur-
rent jobs reuse operators if processing logic is duplicated, Runtime overheads. Table 13a shows the per-node run-
the interference between concurrent jobs is also low. Fi- time overheads of E LF, Flume, Storm master, and Storm
nally, these results also demonstrate that SRT scales well worker. Experiments are run on 60 nodes, each with 12
with the datacenter size, i.e., number of webservers, as cores and 48GB RAM. As Table 13a shows, E LF’s run-
the reduce times increase only linearly with exponential time overheads is small, comparable to Flume, and much
increase in the number of agents. This is because reduce less than that of Storm master and Storm worker. This
10
34 2014 USENIX Annual Technical Conference USENIX Associationis because both E LF and Flume use a decentralized ar- the network. CBP [18] and Comet [15] run MapReduce
chitecture that distributes the management load across jobs on new data every few minutes for “bulk incremental
the datacenter, which is not the case for Storm master. processing”, with all states stored in on-disk filesystems,
Compared to Flume, which only collects and aggregates thus incurring latencies as high as tens of seconds. Spark
streams, E LF offers the additional functionality of pro- Streaming [28] divides input data streams into batches
viding fast, general stream processing along with per-job and stores them in memory as RDDs [27]. By adopt-
managemen mechanisms. ing a batch-computation model, it inherits powerful fault
tolerance via parallel recovery, but any dataflow modifi-
Network overheads. Figure 13b 13c show the additional
cation, e.g., from pipeline to cyclic, has to be done via
network traffic imposed by E LF with varying update in-
the single master, thus introducing overheads avoided
tervals, when running the Twitter application. We see that
by E LF’s decentralized approach. For example, it takes
the number of packets and number of bytes sent per agent
Spark Streaming seconds for iterating and performing
increase only linearly, with an exponential increase in the
incremental updates, but milliseconds for E LF.
number of agents, at a rate less than the increase in up-
All of the above systems inherit MapReduce’s “single
date frequency (from 1/30 to 1/5). This is because most
master” infrastructure, in which parallel jobs consist of
packets are ping-pong messages used for overlay and SRT
hundreds of tasks, and each single task is a pipeline of
maintenance (initialization and keep alive), for which any
map and reduce operators. The single master node places
agent pings to a limited set of neighboring agents. We
those tasks, launches those tasks, maybe synchronizes
estimate from Figure 13b that when scaling to millions of
them, and keeps track of their status for fault recovery
agents, the additional #package is still bounded to 10.
or straggler mitigation. The approach works well when
5 Related Work the number of parallel jobs is small, but does not scale to
hundreds of concurrent jobs, particularly when these jobs
Streaming Databases. Early systems for stream process- differ in their execution models and/or require customized
ing developed in the database community include Au- management.
rora [29], Borealis [6], and STREAM [10]. Here, a query
Large-scale Streaming Systems. Streaming systems
is composed of fixed operators, and a global scheduler de-
like S4 [21], Storm [4], Flume [1], and Muppet [16] use
cides which tuples and which operators to prioritize in ex-
a message passing model in which a stream computation
ecution based on different policies, e.g., interesting tuple
is structured as a static dataflow graph, and vertices run
content, QoS values for tuples, etc. SPADE [14] provides
stateful code to asynchronously process records as they
a toolkit of built-in operators and a set of adapters, target-
traverse the graph. There are limited optimizations on
ing the System S runtime. Unlike SPADE or STREAM
how past states are stored and how new states are inte-
that use SQL-style declarative query interfaces, Aurora
grated with past data, thus incurring high overheads in
allows query activity to be interspersed with message
memory usage and low throughput when operating over
processing. Borealis inherits its core stream processing
larger time windows. For example, Storm asks users to
functionality from Aurora.
write codes to implement sliding windows for trend top-
MapReduce-style Systems. Recent work extends the ics, e.g., using Map, Hashmap data structure. Muppet
batch-oriented MapReduce model to support continuous uses an in-memory hashtable-like data structure, termed a
stream processing, using techniques like pipelined paral- slate, to store past keys and their associated values. Each
lelism, incremental processing for map and reduce, etc. key-value entry has an update trigger that is run when
MapReduce Online [12] pipelines data between map new records arrive and aggressively inserts new values to
and reduce operators, by calling reduce with partial data the slate. This creates performance issues when the key
for early results. Nova [22] runs as a workflow manager space is large or when historical windowsize is large. E LF,
on top of an unmodified Pig/Hadoop software stack, with instead, structures sets of key-value pairs as compressed
data passes in a continuous fashion. Nova claims itself as buffer trees (CBTs) in memory, and uses lazy aggregation,
a tool more suitable for large batched incremental process- so as to achieve high memory efficiency and throughput.
ing than for small data increments. Incoop [11] applies Systems using persistent storage to provide full fault-
memorization to the results of partial computations, so tolerance. MillWheel [7] writes all states contained in ver-
that subsequent computations can reuse previous results tices to some distributed storage system like BigTable or
for unchanged inputs. One-Pass Analytics [17] optimizes Spanner. Percolator [23] structures a web indexing com-
MapReduce jobs by avoiding expensive I/O blocking op- putation as triggers that run when new values are written
erations such as reloading map output. into a distributed key-value store, but does not offer con-
iMR [19] offers the MapReduce API for continuous sistency guarantees across nodes. TimeStream [24] runs
log processing, and similar to E LF’s agent, mines data the continuous, stateful operators in Microsoft’s StreamIn-
locally first, so as to reduce the volume of data crossing sight [8] on a cluster, scaling with load swings through
11
USENIX Association 2014 USENIX Annual Technical Conference 35You can also read