Hancock: A Language for Processing Very Large-Scale Data
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hancock: A Language for Processing Very Large-Scale Data Dan Bonachea Kathleen Fisher Anne Rogers Frederick Smithy AT&T Labs Shannon Laboratory 180 Park Avenue Florham Park, NJ 07932, USA bonachea@cs.berkeley.edu fkfisher,amrg@research.att.com fms@cs.cornell.edu Abstract Mawl [ABB+ 97] for dynamic web servers, and A signature is an evolving customer pro- Teapot [CRL96, CDR+ 97] for writing coher- ence protocols. In each case, the domain- le computed from call records. AT&T uses speci c language moved the burden of writ- signatures to detect fraud and to target mar- ing intricate domain code from the program- keting. Code to compute signatures can be mer to the compiler (or interpreter). We dicult to write and maintain because of the have designed and built a domain-speci c lan- volume of data. We have designed and imple- guage, called Hancock, to handle the com- mented Hancock, a C-based domain-speci c plexity that arises from the scale of signature programming language for describing signa- computations [CP98]. As with the earlier lan- tures. Hancock provides data abstraction guages, Hancock programs are easier to write, mechanisms to manage the volume of data debug, read, and maintain than the equiva- and control abstractions to facilitate loop- lent programs written in general-purpose pro- ing over records. This paper describes the gramming languages. design and implementation of Hancock, dis- Signatures are pro les of customers that cusses early experiences with the language, are updated daily from information collected and describes our design process. on AT&T's network. They are used for a vari- ety of purposes, including fraud detection and marketing. For example, AT&T uses signa- 1 Introduction tures to track the typical calling pattern for each of its customers. If customers far exceed There are many families of programs whose their usual calling levels, the fraud detection members have a high degree of commonal- system raises alerts. In contrast, a sudden ity; such a family is called a domain. When drop in their calling levels signals that they the commonality is inherently complex, a may have chosen another long-distance car- domain-speci c programming language may rier, triggering a marketing alert. Cortes and help domain programmers develop better Pregibon describe signatures and their uses software more quickly by factoring the com- in more detail [CP98]. plexity into the language. Recent examples At a high level, the computation of a signa- of domain-speci c languages and their do- ture is straightforward: process a list of call mains include Envision [SF97] for computer records and update data in a le based on vision, Fran [Ell97] for computer animation, those records. Unfortunately, performance GAL [TMC97] for video card device drivers, requirements that arise from the volume of Now in the EECS department at the University call records and the amount of stored pro le of California at Berkeley. data complicate matters substantially. E- y Now in the CS department at Cornell University ciently coping with the data requires a more
complex system architecture.. Furthermore, tween the representation that we can a ord the scale pressures programmers to conserve to store and the representation that we want instructions, which tends to reduce program to compute with complicates the code. readability. The complexity of the architec- Hancock is a C-based domain-speci c lan- ture and the code complicates testing, which guage that reduces the coding e ort of writ- is already dicult because of long testing cy- ing signatures and improves the clarity of the cles. resulting code by factoring into the language The scale of signature computations arises all the issues that relate to scale. In par- from both the number of calls and the number ticular, Hancock programmers can focus on of customers. On a typical weekday, there are the signature they want to compute rather more than 250 million calls made on AT&T's than the sca olding necessary to compute it. network. The call records that are used for The language includes constructs for describ- signature computations contain only 32 of the ing the overall system architecture, for speci- more than 200 bytes of data that are col- fying work that should be done in response lected for each call. To get a sense of the to events in the call stream, and for spec- scale of this data, it takes about 15 minutes ifying the representation of signature data. to read through one weekday's call data (ap- The Hancock compiler uses these constructs prox 6.5GB) and about two hours to com- to generate the boiler-plate code that makes pute a typical signature on one processor of a hand-written signatures dicult to maintain, 16 processor SGI Origin 2000. A typical sig- without sacri cing eciency. The Hancock nature tracks several hundred million phone runtime system supports external sorting and numbers. Even if we store only two bytes of ecient access to signature data. Hancock data per customer, the les are very large, does not address directly the problem of test- requiring a minimum of 500MB to hold the ing; instead, it reduces opportunities for bugs data. Such les also require signi cant addi- by relieving programmers of the burden of tional space to provide appropriate indexing. writing intricate code. This scale complicates the architecture of This paper describes the design and imple- signature programs because at such levels, mentation of Hancock. Section 2 describes I/O presents a signi cant bottleneck. To the domain in more detail and gives an ex- reduce this bottleneck, signature programs ample signature. Sections 3-5 present Han- must be structured to minimize I/O. In cock's data and control abstractions. We dis- particular, signature programs sort the call cuss the implementation of Hancock's com- records that they process to improve locality piler and runtime system in Section 6, early of reference to their signature les, and they experiences with Hancock in Section 7, and cache parts of these les to reduce the number our design process in Section 8. We conclude of disk accesses. Both techniques trade im- in Section 9. proved performance for increased code com- plexity. In addition to a ecting the high-level ar- chitecture, the amount of data complicates 2 Signature domain the low-level code structure. Programmers Signatures are a way to associate infor- writing signature programs have tended to re- mation with individual telephone numbers. spond to performance pressures by avoiding For each phone number, a typical signature function calls and thereby writing less mod- contains information about the characteris- ular code. The deeply nested code that re- tics of outgoing calls from that number and sults is dicult to decipher, which is a serious incoming calls to it. The outgoing data is problem because it is often the only documen- often split into sub-categories based on the tation of the signature it implements. An- type of call (for example, toll-free, interna- other complication comes from the fact that tional, intra-state, and other). This sec- the size of the signature les limits the num- tion describes the process of computing sig- ber of bytes we can store per phone number. natures, discusses the representation of signa- As a result, we cannot store exact informa- ture data, and presents an example signature. tion for each phone number; instead we must We rst de ne some basic terminology. Sig- approximate. Managing the translation be- natures maintain information about phone
Outgoing phase Incoming phase Update Update Sort by Sort by Outgoing Incoming Call records Origin Dialed Data Data Copy Old Data New Data Figure 1: High-level architecture of signature computations numbers rather than customers. A second Signature database can be used to match signature data with customer information, but that process is outside of Hancock's domain. A telephone number (973 555 1212) contains an area code (973), an exchange (555), and a line num- Freeze Thaw ber (1212). We often use the term exchange to mean the rst six digits of a phone num- ber (973 555) and the term line to mean the whole phone number. Hancock supports a few basic types for phone numbers, dates, and A p p r o x i m a ti o n times: area code t, exchange t, line t, date t, and time t. It also supports a type for call records: Figure 2: Signature data representation typedef struct { line_t origin; line_t dialed; Figure 1 shows a graphical depiction of the date_t connectTime; typical architecture we use for such compu- time_t duration; tations. We rst sort the call records by the char isIncomplete; originating phone number and then update char isIntl; the outgoing portion of the signature for each char isTollFree; phone number that made a call. We then re- ... peat this process, sorting the call records by } callRec_t; the dialed number and updating the incoming Each call record contains two phone numbers: portion of the signature for each phone num- the originating number and the dialed num- ber that received a call. This architecture re- ber. Each record also stores the time the call duces the I/O needed to process a signature was connected (connectTime) and the dura- by ensuring good locality for references to the tion of the call in seconds (duration). Call stored signature data at the cost of a pair of records also contain boolean ags describ- external sorts. ing the type of call: isIncomplete, isIntl, isTollFree, etc. 2.2 Signature representation 2.1 Signature computation As mentioned earlier, the size of signature les limits the number of bytes we can keep We update signature data daily from the for each phone number. As a result, we can- records of the calls made the previous day. not keep exact information for each line. In-
stead, we need to approximate. Conceptually, outgoingUsage(origin, calls) in each signature we have two views for our int cumTollFree = 0; data: a precise form called the signature view int cumOut = 0; and an approximate form called the approx- uApprox = Get usage data for origin imation view. We compute with the signa- uSig = Convert uApprox from buckets ture, but store the approximation. The choice to seconds of these two views is application speci c, as is the method for converting between them. for c in calls do We call the process of converting from the if c.isTollFree then signature to the approximation view freezing; cumTollFree += c.duration the converse process, thawing. (See Figure 2.) else Note that thaw(freeze(v)) does not neces- cumOut += c.duration sarily equal v because freezing is often lossy. endif Having two views for the data allows pro- done grammers to compute with the natural rep- uSig.outTollFree = resentation while saving disk space, but it blend(cumTollFree, uSig.outTollFree) comes at the cost of having to manage con- uSig.out = blend(cumOut, uSig.out) versions between the two views. uApprox = Convert uSig from seconds 2.3 Usage signature to buckets. Record new uApprox data for origin This section describes the Usage signature, which we use as a running example. Usage approximates cumulative daily call duration Figure 3: Pseudo-code for computing part of for incoming calls, outgoing calls, and out- the Usage signature going calls to toll-free numbers. Usage uses the same structure to track all three types for a single line, called origin in the pseudo- of calls. In particular, the signature type code. A detailed version of Usage that tracks for each is seconds; the approximation type, some additional information can be found in a bucket number between zero and fteen. Appendix A. The buckets represent non-uniform ranges of durations. Bucket zero corresponds to very short calls and serves as the default value for 3 Data model lines with no recorded activity, while bucket fteen corresponds to long calls. Thawing This section describes Hancock's data converts bucket numbers to seconds by asso- model, which includes a model for collections ciating a default duration with each bucket. of call records and a model for pro le data. Freezing identi es the bucket with the appro- priate range of times. 3.1 Call stream There are two parts to the daily Usage com- putation. First, we accumulate the precise We model a collection of call records as usage in seconds for a particular phone num- a stream in Hancock. Programmers use the ber for a particular type of call, and then we stream type operator to declare a new stream blend that data with the existing signature for type. Such a declaration names the new that type of call. The code for blending is: type and speci es both the physical and the logical representations of the records in the blend(new, old) = new*lambda + stream. Intuitively, the physical representa- old*(1-lambda) tion describes the (highly encoded) structure of the records as they exist on disk, while the Blending of this form is common in signature logical representation describes an expanded computations. form convenient for programming. The dec- Pseudo-code for computing part of the laration speci es a function to convert from Usage signature appears in Figure 3. For encoded physical to expanded logical records. brevity, we include only the code for com- For example, the following code declares a puting the outgoing portion of the signature stream type callStream:
stream callStream { The ufSig(b) = ... portion of the record getvalidcall : PCallRec_t => declaration speci es how to thaw ufApprox b to produce a ufSig value. Similarly, callRec_t; } ufApprox(s) = ... speci es how to freeze For this stream, the physical type is a uSig s to obtain a uApprox value. In this PCallRec t, the logical type is callRec t, application, uApprox(s) uses the function and the conversion function getvalidcall secToBucket to convert the seconds stored in constructs a logical record from a physical integer s into a bucket number, and uSig(b) one. Function getvalidcall has type uses the array bucketToSec to convert a bucket stored in b into the corresponding char getvalidcall(PCallRec_t *pc callRec_t *c) mean number of seconds for that bucket. To- gether, these expressions are an example of This function checks that the record *pc is where thaw(freeze(s)) does not equal s. valid, and if so, unpacks *pc into *c and re- Records can have more than one eld (in turns true to indicate a successful conver- which case the elds are named), and they sion. Otherwise, getvalidcall simply re- can be included in other record declarations. turns false. Programmers can declare vari- For example, a second record declaration ables of type callStream using standard C that appears in the Usage signature, uLine, syntax (for example, callStream calls). has the following form: We represent streams on disk as a directory that contains binary les. Hancock's wiring- record uLine(uSig, uApprox){ diagram mechanism, which we discuss in Sec- uField in; uField out; tion 5, provides a way to match the name of uField outTF; a directory to a stream. } 3.2 Signature data As in our earlier example, this declaration introduces three types: uLine, uSig, and Hancock provides two mechanisms for de- uApprox. The type uLine is the type of the scribing signature data. Programmers use the record. The types uSig and uApprox are record declaration to specify the format of a equivalent to C structures constructed from pro le and the map declaration to specify the the left and right types of uField: mapping between phone numbers and pro- les. typedef struct { Records are designed to capture the rela- int in; tionship depicted in Figure 2. They specify int out; int outTF; the types for the signature and approxima- } uSig; tion views of a pro le, as well as the freeze and thaw expressions for converting between typedef struct { these types. We use the following simple char in; record to introduce the pieces of a record char out; declaration. char outTF; } uApprox; record uField(ufSig, ufApprox){ int char; Because the record uLine does not include ufSig(b) = bucketToSec[b]; explicit freeze and thaw expressions, Han- uApprox(s) = secToBucket(s); cock constructs them automatically from the } freeze and thaw expressions of the record's This declaration introduces three types: elds. For this record, the compiler con- structs the following freeze function: uField: the type of the record, uApprox freeze(uSig s){ ufSig: the type of the left-hand view uApprox a; a.in = secToBucket(s.in); (int), and a.out = secToBucket(s.out); ufApprox: the type of the right-hand a.outTF = secToBucket(s.outTF); return a; view (char). }
The thaw function is constructed similarly. line_t pn; Although this example record only contains u = usage; elds with record types, elds may also have ... regular C types. In this context, C types can usage = u; be thought of as records with the same left- and right-hand type and the identity function gives an example of reading from and writing for freezing and thawing. to a map. The usual idiom for accessing map To convert between views, Hancock pro- data combines the indexing and view opera- vides the view operator ($). The expression tors as in: ua$uSig converts uApprox ua to a uSig, us- ing the conversion speci ed implicitly in the us = usage$uSig; uLine declaration. Expression us$uApprox Hancock also provides an operator :=: to behaves analogously. copy maps. In particular, the statement Hancock's map declaration provides a way to associate data with keys. Typically a map new_usage :=: usage; does not contain data for every possible key. Consequently, Hancock supports the notion causes uMap map new usage to be initialized of a default value, which is returned when a with the data from usage. programmer requests data for a key that does not have a value stored in the map. For ex- 3.3 Discussion ample, in the following map declaration, the By providing the programmer with appro- keys have type line t, the data are struc- priate abstractions, Hancock reduces the in- tures of type uApprox, and the default value tellectual burden of writing signatures. Al- is the constant uApprox structure consisting though programmers were freezing and thaw- of all zeros. ing their data prior to Hancock, they had not abstracted this idea. The result was numer- map uMap { ous bugs caused by confusing the types of the key line_t; two views. The structure enforced by records value uApprox; eliminates many of these bugs by requiring } default {0,0,0}; programmers to document the relationship between the two views, and to apply the view Defaults may also be speci ed as functions operator to convert between them explicitly. that use the key in question to compute an As an added bene t, records simplify signa- appropriate default for that key. For ex- ture code by generating conversion functions ample, the map declaration below speci es a automatically from record elds when possi- function, lineToDefault, to call with the ble. line in question when a default record is Maps provide an ecient implementation needed. for the most performance critical part of sig- nature programs. The index operation is map uMapF { more convenient than a library interface, and key line_t; it provides stronger type-checking. value uApprox; default lineToDefault; } 4 Computation Model A common use for this mechanism is to Hancock's computation model is built construct defaults by querying another data around the notion of iterating over a sorted source. stream of calls. Sorting call records en- The identi er uMap names a new map type. sures good locality for references to the signa- Variables of this type can be declared us- ture data that follow the sorting order. O - ing the usual C syntax (for example, uMap direction references may not have good local- usage). Hancock provides an indexing oper- ity, however. For example, if we sort the call ator to access values in a map. The records by the originating number, then up- code: dating the usage for that number would have
good locality, while updating the usage for processing the stream simpli es the process- the dialed number would su er from bad lo- ing code. cality. Consequently, signature computations The withevents clause speci es which are typically done with multiple passes over events are relevant to this signature. We the data, each sorting the data in a di erent call the occurrence of a group of calls in the order and updating a di erent part of the pro- stream an event. Depending on the sorting le data. We call each such pass, represented order, di erent groups of calls can be identi- in Figure 1 as a dashed box, a phase. ed in the call stream. For example, if the A phase starts by specifying a name and calls are sorted by originating number, the a parameter list. The body of a phase has possible groups are: three pieces: an iterate clause, a list of vari- able declarations, and a list of event clauses. calls for the same area code, The following pseudo-code outlines the out- calls for the same exchange, going phase of the usage signature: phase out(callStream calls, uMap usage) { calls for the same line, or iterate clause variable declarations a single call. event clauses } Within an event there are two important sub- events: the beginning of the block of calls This code de nes a phase out that takes two and its ending. The possible events and parameters: a stream of calls and a usage sub-events are related hierarchically; calls are map. The iterate clause speci es an initial nested with lines, lines within exchanges, and stream and a set of transformations to pro- exchanges within area codes. duce a new stream. Variables declared at the Putting all these pieces together, the iter- level of a phase are visible throughout that ate clause for the outgoing phase of Usage has phase. The event clauses specify how to pro- the following form: cess the transformed stream. The next two subsections describe these clauses. iterate over calls 4.1 Iterate Clause sortedby origin filteredby noIncomplete Through the iterate clause, Hancock allows withevents line, call; programmers to transform a stream of logical call records into a stream of records tailored This code speci es that calls is the initial to the particular signature computation. The stream, that it should be sorted by the origi- iterate clause has the following form: nating number, that it should be ltered us- ing function noIncomplete, which removes iterate over stream variable incomplete calls from the stream, and that sortedby sorting order two events, line and call, are of interest. filteredby lter predicate withevents event list 4.2 Event Clauses We explain each of these pieces in turn. The iterate clause speci es the events of The over clause names an initial stream to interest in the stream, but it does not indi- transform. The sortedby clause speci es a cate what to do when an event is detected. sorting order for this stream. For example, The event clauses of a phase specify code to the calls could be sorted by the originating execute in response to a given event. We il- phone number or by the connect time. At lustrate this structure in Figure 4. Program- present, the allowable sorting orders are hard- mers supply event code for the boxes, while coded into Hancock. The filteredby clause Hancock generates the control- ow code that speci es a predicate that is used to remove corresponds to the arrows. unneeded records from the stream. For ex- Each event has a name that corresponds to ample, a call stream may include incomplete the event name listed in the iterate clause. calls, which are not used by the Usage signa- Each event takes as a parameter the portion ture. Removing unneeded call records before of the call record that triggered the event.
Begin End event line(line_t pn) { uSig cumSec; begin { Area Code cumSec.out = 0; cumSec.outTF = 0; } Exchange /* process calls */ end { Line uSig us; us = usage$uSig; us.outTF = Call blend(cumSec.outTF, us.outTF); us.out = blend(cumSec.out, us.out); usage = us$uApprox; } Programmer supplied code } Control flow managed by Hancock event call(callRec_t c) { uSig line::cumSec; Figure 4: Hierarchical event structure if (c.isTollFreeCall) cumSec.outTF += c.duration; For example, an area code event is passed else the area code shared by the block of calls cumSec.out += c.duration; that triggered the event. The body of each } event has three parts: a list of variable dec- larations, a begin sub-event, and an end sub- Figure 5: Event code for Usage signature event. Variables declared at the level of an event are shared by both its sub-events and are available to events lower down in the 4.3 Discussion hierarchy, using a scoping operator (event name::variable name). A common pattern Hancock's event speci cations have several is for the programmer to declare a variable in advantages. First, the speci cations have the code for a line event, to initialize it in the the avor of function de nitions with their begin-line sub-event, to accumulate informa- attendant modularity advantages, but with- tion using that variable while processing calls, out their usual cost because the Hancock and then to store the accumulated informa- compiler expands the event de nitions in-line tion during the end-line sub-event code. with the control- ow code.. Second, having A sub-event is a kind (begin or end) fol- the compiler generate the control ow re- lowed by a C-style block that contains a list moves a signi cant source of bugs and com- of variable declaration and Hancock and C plexity from Hancock programs. Finally, pro- statements. Variables declared in a sub-event grammers can use the variable-sharing mech- are visible only within that sub-event. The anism to share information across events. code in Figure 5 implements the line and call A common question when thinking about events for the outgoing phase of Usage. designing a domain-speci c language is Note that the call event does not have whether or not a library would suce. We sub-events. We call such events bottom-level rejected the library option largely because events. The lowest event in any list of events of Hancock's control- ow abstractions. In is a bottom-level event. For example, Fre- particular, expressing Hancock's event model quency, a signature that we discuss later, does and the information sharing it provides not collect summary information for a set of proved awkward in a call-back framework, the calls. It tracks only the existence of at least usual technique for implementing such ab- one call, so line is its bottom-level event. stractions.
5 Wiring Diagram signature data. A question that arises is: are these references referring to data computed In the previous section, we explained that in a previous phase or to data computed the computing a signature may require multiple previous day? This question can be answered passes over the data. Hancock provides the by looking at sig main. If the parameters sig main construct to express the data ow to the phase do not include the original in- between passes and to connect command-line put map, then all references must be to the input to the variables in the program. The partially computed map. arcs between the phase boxes in Figure 1 de- The automatic generation of argument pict this construct. The following code im- parsing code is convenient and removes a plements sig main for Usage: source of tedium, but its real bene t is that void sig_main( it connects Hancock variables to their on- const callStream calls , disk counterparts. It helps programmers pro- exists const uMap y_usage , tect valuable data through the const, new new uMap usage ) { and exists quali ers. The runtime system usage :=: y_usage; catches attempts to write to constant data out(calls, usage); and generates error messages.1 It detects in(calls, usage); when data annotated as new already exists } or when data tagged with exists is not on There are three parameters to the Usage disk, in each case reporting a run-time error. signature. The rst is a stream that contains These data-protection features are important the raw call data. The const keyword indi- when it is time-consuming or even impossi- cates that this data is read-only. The syn- ble to reconstruct an accidentally overwritten tax () after the variable name calls signature. speci es that this parameter will be supplied as a command-line option using the -c ag. The colon indicates that this ag takes an 6 Implementation argument, in this case the name of the direc- tory that holds the binary call les. The ab- Our implementation of Hancock consists of sence of a colon indicates that the parameter a compiler that translates Hancock code into is a boolean ag. The Hancock compiler gen- plain C code, which is then compiled and erates code to parse command-line options. linked with a runtime system to produce ex- The second parameter is a Usage map, the ecutable code. We modi ed ckit, a C-to-C name of which is speci ed using the -u ag. translator written in ML[SCHO99] to parse The const quali er indicates the map is read- Hancock and translate the resulting extended only, while the exists annotation indicates parse tree into abstract syntax for plain C. the map must exist on disk. The nal param- The compiler generates code for the various eter names the Usage map used to hold the Hancock operators and clauses and for the result of this signature computation; the -U main routine. The runtime system, which is ag speci es the le name for this map. The written in C, manages the representation of new quali er indicates that the map must not Hancock data on-disk and in memory. It con- exist on disk. verts between these representations as neces- In general, the body of sig main is a se- sary and it mediates all access to the data. quence of Hancock and C statements. In Us- We were able to build Hancock relatively age, sig main copies the data from y usage quickly by leveraging other people's soft- into usage and then invokes Usage's outgoing ware. In particular, we built the com- and incoming phases with the raw call stream piler on top of ckit, an existing C-to-C and the Usage map under construction as ar- translator[SCHO99] written for the purpose guments. of building compilers for C-based domain- speci c languages. We also used a col- 5.1 Discussion lection of libraries: msort[Lin99, MMB92], The wiring diagram clari es the data ow s o[KV91], and vmalloc[Vo96]. Msort pro- between phases. For example, some signa- 1 We intend to check for writes to const data at tures need to make o -direction references to compile time eventually.
Table 1: Example signatures. signature values to the largest value in the range and values below the range to the low- est value in the range. Signature Description In all four signatures, the amount of Han- Usage Average daily usage cock code needed to describe the data is Frequency Calling frequency small. The largest, Bizocity, takes fewer than Activity Days since last seen 30 lines. Bizocity \Business-likeness" In terms of control- ow, the example sig- natures share the same high-level structure, each containing two phases: one to compute information for outgoing calls and another vides an external sorter and s o supports 64- for incoming calls. The event structures for bit les, making these two libraries particu- the signatures are di erent, however. Fre- larly useful in addressing issues of scale. quency tracks only the existence of a call for a given number, so its bottom-level event is the line event. Activity and Usage do work 7 Early experiences at both call and line events. Bizocity uses these events and does signi cant computation Table 1 brie y describes four signatures: at the exchange level. Usage, Frequency, Activity, and Bizocity. The Hancock code that implements these These signatures are computed daily from call phases is small: the smallest, Frequency, records. In this section, we discuss how these takes 40 lines of code; the largest, Bizocity, signatures use Hancock's data and control- takes 300 lines, more than 100 of which are ow mechanisms. for processing exchange events. The maps used by these signatures all have In all, these examples indicate that Han- index types of line t and value types that cock data descriptions are compact and that are records. Usage, Frequency, and Activity the event processing code is modest in size. use constant defaults. Bizocity uses a default Ideally, we would like to compare the Han- function that queries a secondary map, which cock implementations with hand-written C indicates whether the phone is a known resi- implementations. Unfortunately, this com- dence, a known business, or unknown. parison is very hard to do fairly. The only The records used in these signatures vary C implementation of Usage, Activity, and Bi- based on the application, but they all have zocity available to us is a program that com- a common form: the desired pro le contains bines the computation of all three signatures several elds that have the same underlying and has code to manage the on-disk represen- structure. To express this structure in Han- tations of the signature les embedded in it. cock, we use two records: one to describe the This program is about 1500 lines of code. basic elds and a second to group these elds into a pro le. Table 2 describes the basic elds for the 8 Design Process sample signatures and indicates how many such elds are contained in the pro le record. This section describes the process that we In all these examples, the approximation type used in the design of Hancock and discusses is a range that can be represented with a C the lessons we learned from our experience. char type. The signatures use di erent ap- Designing a domain speci c language involves proximation techinques. Bucketing divides developing a model of the domain and then the range of signature values into disjoint embodying that model in a language. The buckets and associates a default value with language should capture the commonalities of each such bucket. With this technique, freez- the domain e ectively and let the domain ex- ing converts a signature value into the con- perts worry about the details speci c to an taining bucket, whereas thawing returns the individual application. Figure 6 depicts the default value for a bucket. Bucketing can use process we followed; it can be viewed as an either xed-width or variable-width buckets. elaboration of the rst box of the FAST pro- Clamping converts values above the range of cess [GJKW97].
Table 2: Record structure for example signatures Signature Signature Approximation Approximation Number of type type method elds Usage int (0-15) variable-width buckets 4 Frequency double (0-255) xed-width buckets 2 Activity int (0-39) clamping 3 Bizocity char (0-15) xed-width buckets 2 First, we alternated talking with domain 8.1 Lessons Learned experts about sample signatures and con- structing models of the domain that captured Our goal in this project was to design and the common elements of these signatures. By implement a language that would be adopted iterating, we got feedback on the models and by signature researchers. While this project developed a common vocabulary. is not yet nished and so we cannot claim Once we had a good model for signatures, complete success at this time, we believe that we applied this model by hand to construct a we are on the right path because the signa- few sample applications. This experience led ture researchers have become advocates for us to replace our model with one based on Hancock. We believe that three things that more primitive abstractions because the orig- we did were responsible for our success and inal abstractions were too high-level to serve could be useful to others who want to design as the basis for a programming language. We domain speci c languages. eventually used these hand-written signatures First, we were open to constantly revising to establish that the code we expected to our models based on many di erent kinds of generate automatically would perform ade- input. Figure 6 highlights the many sources quately and to evaluate whether a library- of feedback that we employed in our design based solution was sucient. process. In addition to consulting with the After we had revised our model based on domain experts at many points in the pro- the hand-written signatures, we translated it cess, we also built artifacts at several inter- into an actual language design. This trans- mediate stages. These artifacts proved useful lation was not straightforward because al- even though they are not the nal product of though the model revealed what concepts our work. needed to be expressible in the language, it Second, we worked closely with domain ex- did not give much guidance as to how they perts and valued their input. This point de- should be expressed. Then we wrote sample serves elaboration, as it seems quite obvious applications using our design, evaluated the that language designers should heed their ex- resulting programs, and revised the design as perts. The fact that domain experts may appropriate. We found that we needed to it- not be expert programmers leads to a temp- erate through this process many times. We tation to dismiss their comments. Sample discussed these sample applications with the implementations they provide may be writ- domain experts to con rm that we had cap- ten badly. Consequently, it is easy to leap tured the essential elements of the domain. to the erroneous conclusion that they do not Finally, we implemented a compiler and know what they are doing. For example, sig- runtime system for Hancock and evaluated nature researchers had developed a map rep- the signatures written in Hancock. This eval- resentation that drew a lot of criticism from uation led us to re ne many aspects of the others outside the domain. When we investi- original design, including the stream model, gated their representation, we learned that al- the view operator, the sig main annotations though their implementation had some prob- new and exists, support for user-supplied lems, the basic structure provided very good compression functions, the record construct, random access time. Such access is crucial and the implementation of maps. to their clients, but it had been ignored by other onlookers. As language designers, we
experts had designed, we demonstrated that domain Hancock programs could manipulate their ex- isting data without diculty. Most impor- experts tantly, by consulting with the domain experts at every point, we improved the design and got them excited about using Hancock. develop models 9 Conclusions Hancock handles the scale of the data used in signature computations, thereby reducing write programs coding e ort and improving the clarity of sig- by hand nature code. The language, compiler, and a la model runtime system provide the sca olding neces- sary to compute signatures, leaving program- mers free to focus on the signatures them- selves. In particular, Hancock's wiring dia- gram makes the relationships among phases design language clear. Its event model allows programmers to specify the work needed to compute a sig- nature without having to write complicated control- ow code by hand. Finally, Han- cock's data model makes writing signatures less error-prone by handling multiple data write programs representations automatically. We plan to extend Hancock in two ways. First, we intend to enrich the set of opera- tions that Hancock provides for streams. For example, we plan to add a reduction oper- ation that would allow programmers to pre- implement system process streams to combine related records. Second, we intend to broaden the class of data that can be processed using Hancock, Figure 6: Design Process for example, to include Internet protocol logs or billing records. To accomplish this goal, we need to provide a mechanism for describ- had to be very careful to separate crucial do- ing data streams. Such a description must main constraints from irrelevant details in the include how such streams can be sorted and existing implementations. how to detect events based on the sorting or- Finally, gaining credibility with domain ex- der. perts was essential because they are the po- tential users for the language. By designing concrete models in response to our discus- 10 Acknowledgments sions with them, we established that we were serious about helping them and that we un- We would like to thank Corinna Cortes and derstood their domain. It also gave us a fo- Daryl Pregibon for their help in understand- cus for our discussions. By handwriting sig- ing the signature domain, Glenn Fowler, John natures in C, we established that we could Linderman, and Phong Vo for their assistance satisfy their performance requirements. By with the run-time system, Nevin Heintze and choosing C as the basis for Hancock, we kept Dino Oliva for their help in using ckit, Dan Hancock close to their usual programming Suciu and Mary Fernandez for discussions environment. By using signature represen- which helped re ne the stream model, and tations consistent with the ones the domain John Reppy for producing the pictures.
References '97 Conference on Domain-Speci c Lan- guages, 1997. [ABB+ 97] Atkins, D., T. Ball, M. Benedikt, G. Bruns, K. Cox, P. Mataga, and K. Re- [Vo96] Vo, K.-P. Vmalloc: A general and ef- hor. Experience with a domain speci c cient memory allocator. Software| language for form-based services. In Pro- Practice and Experience, 26, 1996, pp. ceedings of the USENIX '97 Conference on Domain-Speci c Languages, 1997. 1{18. + [CDR 97] Chandra, S., M. Dahlin, B. Richards, R. Y. Wang, T. E. Anderson, and J. R. Larus. Experience with a language for A The Usage signature writing coherence protocols. In Proceed- ings of the USENIX '97 Conference on #define NUMBINS 16 Domain-Speci c Languages, 1997. int bucketToSec[NUMBINS]= { ... }; [CP98] Cortes, C. and D. Pregibon. Giga min- char secToBucket(int v) { ... }; ing. In Proceedings of the Fourth Inter- national Conference on Knowledge Dis- record uField(ufSig, ufApprox) { covery and Data Mining, 1998. int char; [CRL96] Chandra, S., B. Richards, and J. R. ufSig(b) = bucketToSec[b]; Larus. Teapot: Language support for ufApprox(s) = secToBucket(s); writing memory coherence protocols. In } Proceedings of the SIGPLAN '96 Confer- ence on Programming Language Design record uLine(uSig, uApprox) { and Implementation (PLDI), 1996. uField in; [Ell97] Elliott, C. Modeling interactive 3D and uField out; multimedia animation with an embedded uField outTF; language. In Proceedings of the USENIX uField outIntl; '97 Conference on Domain-Speci c Lan- } guages, 1997. [GJKW97] Gupta, N. K., L. J. Jagadeesan, E.. E. map uMap { Koutso os, and D. M. Weiss. Auditdraw: key line_t; Generating audits the fast way. In Pro- value uApprox; ceedings of the Third IEEE Symposium default {0,0,0}; on Requirements Engineering, 1997. } [KV91] Korn, D. G. and K.-P. Vo. SFIO: Safe/fast string/ le IO. In Proc. of the Summer '91 Usenix Conference. USENIX, 1991, pp. 235{256. #define LAMBDA .15 [Lin99] Linderman, J. Msort. Private communi- #define blend(new, old) \ cation, 1999. (((new) * LAMBDA) + \ ((old)*(1 - LAMBDA))) [MMB92] McIlroy, M. D., P. M. McIlroy, and K. Bostic. Engineering radix sort. Technical Memorandum 11260-920902- 23TMS, AT&T Bell Labs, Murray Hill, NJ, September 1992. #include calls.h [SCHO99] Si , M., S. Chandra, N. Heintze, and int getvalidcall(PCallRec_t *pc, D. Oliva. Pre-release of C-frontend callRec_t *c){...} library for SML/NJ. See http://cm.bell- stream callStream labs.com/cm/cs/what/smlnj/index.html., {getvalidcall: pCallRec_t => 1999. callRec_t} [SF97] Stevenson, D. E. and M. M. Fleck. Pro- gramming language support for digitized char noIntlOrIncomplete(callRec_t *c) { images or, The monsters in the closet. return !(c->isIncomplete) && In Proceedings of the USENIX '97 Con- !(c->isIntl); ference on Domain-Speci c Languages, } 1997. [TMC97] Thibault, S., R. Marlet, and C. Consel. char noIncomplete(callRec_t *c) { A domain speci c language for video de- return !(c->isIncomplete); vice drivers: From design to implemen- } tation. In Proceedings of the USENIX
phase out(callStream calls, phase in(callStream calls, uMap usage) { uMap usage){ iterate iterate over calls over calls sortedby origin sortedby dialed filteredby noIncomplete filteredby noIntlOrIncomplete withevents line, call; withevents line, call; event line(line_t pn) { event line(line_t pn) { uSig cumSec; uSig cumSec; begin { begin { cumSec.outTF = 0; cumSec.in = 0; cumSec.outIntl = 0; } cumSec.out = 0; } end { uSig us = usage$uSig; end { us.in = uSig us = usage$uSig; blend(cumSec.in, us.in); us.outTF = usage = us$uApprox; blend(cumSec.outTF, us.outTF); } us.outIntl = } blend(cumSec.outIntl, us.outIntl); us.out = event call(callRec_t c) { blend(cumSec.out, us.out); uSig line::cumSec; usage = us$uApprox; } cumSec.in += c.duration; } } /* end call event */ }/* end in phase */ event call(callRec_t c) { uSig line::cumSec; if (c.isTollFree) cumSec.outTF += c.duration; else if (c.isIntl) void sig_main( cumSec.outIntl += c.duration; const callStream calls ) { else exists const uMap y_usage , cumSec.out += c.duration; new uMap usage } /* end call event */ }/* end out phase */ usage :=: y_usage; out(calls, usage); in(calls, usage); }
You can also read