StaXML: Static Typing of XML Document Fragments for Imperative Web Scripting Languages
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
StaXML: Static Typing of XML Document Fragments for Imperative Web Scripting Languages ∗ Adam D. Bradley, Assaf J. Kfoury, and Azer Bestavros {artdodge,kfoury,best}@cs.bu.edu BUCS-TR-2004-xxx Abstract It is important to ensure that programs which are in- tended to produce correct XML indeed do so. Some We present a type system, S TA XML, which employs projects have done this using domain-specific languages the stacked type syntax to represent essential aspects of which transform valid XML into valid XML [9, 5]. For the potential roles of XML fragments to the structure of imperative languages, there have been two generations complete XML documents. The simplest application of of techniques for static verification: The first used spe- this system is to enforce well-formedness upon the con- cial template data types [8] which are populated in very struction of XML documents without requiring the use constrained ways with non-markup text; the second em- of templates or balanced “gap plugging” operators; this ployed a more lenient system in which typed gaps can allows it to be applied to programs written according to be plugged using fragments of correctly balanced-and- common imperative web scripting idioms, particularly nested XML (which may themselves contain additional the “echo”ing of unbalanced XML fragments to an out- gaps) [10, 2]. Work in functional languages has gener- put buffer. The system can be extended to verify partic- ally centered around manipulating an underlying typed ular XML applications such as XHTML and identifying tree data model which is then serialized as XML by pro- individual XML tags constructed from their lexical com- cesses beyond the reach of the programmer [16, 12, 13]; ponents. We also present S TA XML for PHP, a proto- some work in imperative languages has also followed type precompiler for the PHP4 scripting language which this model [6]. infers StaXML types for expressions without assistance The template approach restricts program output to the from the programmer. point of disallowing useful structures like recursive nest- ing with dynamic depth. While the gap-plugging ap- 1 Introduction proach is far more lenient, it also imposes upon the pro- grammer a strong parallelism between the structure of The XML textual media type [15] has garnered tremen- program’s codepaths and the tree structure of the XML dous attention from both the technical press and the sci- being produced; in effect, a subtree must be delineated at entific community. XML textually encodes structured a single point in the program, not with separate “begin” data; XML applications have been formulated which re- and “end” points. In practice, this means the program- flect a number of useful types of data, from traditional mer is forced to adopt a highly functional programming Internet objects (Web page markup, email messages) to style, in which she is in effect grafting together whole programming languages [9] to privacy contracts (P3P). trees rather than progressively writing the serial content XML is easily parsed and manipulated textual encod- of a document (which is the normal and natural impera- ing of tree-structured data. A well-formed XML doc- tive web scripting style). ument consists of an interleaving of text and markup; This paper proposes the S TA XML family of type sys- markup is structural meta-data encoded as tags, strings tems which employ the stacked type syntax to broaden which mark the beginning or end of embedded substruc- the range of XML-constructing imperative programs tures. To ensure the tree structure is unambiguous, tags which can be statically type-checked to include pro- must be balanced (each start-tag must be followed by an grams written in this idiom. Rather than using spe- end-tag) and nested (an end-tag must correspond with cialized gap plugging operators, our approach uses only the last unmatched start-tag preceding it). Thus, an typed tokenized strings and a special polymorphically XML parser can always reconstruct the data structure typed form of concatenation which tracks and reconciles encoded by an XML document, even if it does not un- structural imbalances within XML fragment strings. derstand the particular semantics of that structure. This type system is a component of i B ENCH (the In- ∗ This research was supported in part by the NSF (awards ternet Programming Workbench), a broad research ini- ANI-9986397, ANI-0095988, CCR-9988529, ITR-0113193, ANI- tiative in our department seeking to integrate useful re- 0205294, and EIA-0202067). sults from the programming language and internetwork-
if ($story) { τ1 header($story); τ2 slashDisplay($story, $next, $prev); $d = $db->getDiscussion($story->sid); τ3 printComments($d); τ4 } else { $message = story_not_found(); header($message); print $message; } footer(); Figure 2: Conceptual Diagram of [τ 1 [τ2 [τ3 [τ4 []]]]] Figure 1: Distilled fragment of article.pl from the Slash-2.2.6 source distribution document ::= Chardata? (element Chardata?)∗ 8 > Text containing no XML tags > > > > (STag, ETag, EmptyTag) > > or partial tags (unmatched or ing communities. The i B ENCH Initiative is particularly < Chardata ::= nested “” characters); focused upon the elevation of useful network constructs > > > Includes comments, processing > to first-class programming language citizens and the de- > > > : instructions, and unparsed velopment of accompanying type systems to enforce data blocks (CDSects). correctness and safety of operations upon and compo- element ::= STag document ETag | EmptyTag sitions of those constructs. STag ::= , , ..., , ..., and their parameterized forms This paper is organized as follows: Section 2 presents (e.g.:) the basics of the stacked type syntax. In Section 3, EmptyTag ::= , , ..., we present a simplified form of our system which illus- trates its premise for a simple XML-like markup lan- and their parameterized forms guage. Section 4 presents the complete S TA XML sys- ETag ::= , , ... tem; Section 5 then outlines extensions which allow the Bad ::= Text containing unmatched or system to recognize particular XML applications (such nested “” characters. as XHTML) and to more aggressively identify XML tags constructed from their lexical components. Section 6 sketches an implementation of the S TA XML system Figure 3: Grammar for S TA XML S ’s Pseudo-XML in a pre-compiler for the PHP4 scripting language, and Section 7 offers concluding remarks and future direc- tions for this work. a type syntax supporting higher-order functional types: τ ::= b|τ →τ 2 Stacked Type Syntax b ::= v|σ v ::= int | bool | real | string | ... The stacked type syntax is a notationally elegant way to σ ::= [] | [s σ] structure an infinite type space: rather than represent- ing the type of an expression with a single symbol (e.g., s ::= stackel1 | stackel2 | ... int or bool), the stacked type syntax expresses the type of an expression with a stack of such symbols. Con- Intuitively, a stacked type (σ) is a LIFO list of stack- ceptually, the element at the head of the stack usually able type elements (s), terminated by the empty-stack describes the most immediate type information needed symbol ([]). There is no reason in principle why ele- in order to interpret or operate upon the value of the ex- ments in the stack can’t be drawn from the set of flat pression, while elements further down the stack repre- primitive types (v). sent latent information about what can be done with the The particular semantics of stackel1, stackel2, etc, and value once the elements above them have been properly of the stacking itself, are defined by the primitive con- interpreted. This is illustrated by Figure 2; the value of stants and operators which can produce them; the syn- type τ2 can only be used if the value of type τ 1 can be tax is generic with respect to semantics. S TA XML is properly interpreted so as to “reach” it, and so on. one particular type system with corresponding seman- The stacked type syntax is a straightforward extension tics which is built using the stacked type syntax; other to most common syntaxes. Consider this toy example of type systems use the syntax in distinct ways [3]. 2
XmlTag: These correspond with the names of XML x ∈ XmlTag ::= a | b | ... | aa | ab | ... | tags; e.g., the name of the “” tag is body. StackEntries: These correspond with the simple XML- ba | bb | ... | aaa | ... oriented tokenization of strings described above: v ∈ StackEntries ::= STag-x | ETag-x | STag-x: For any XmlTag x, STag-x represents EmptyTag-x | Chardata | the set of strings of the form “”, generalstring where “(...)” can be any set of whitespace- σ ∈ Stacks ::= [] | [v σ] separated parameters. ETag-x: For any XmlTag x, ETag-x represents the string “” and its whitespace variants. Figure 4: Syntax of S TA XML S Types EmptyTag-x: For any XmlTag x, EmptyTag-x represents the set of strings of the form 3 Simplified S TA XML: S TA XML S “”, where “(...)” is as above. This section examines a technical precursor to the full Chardata: The set of all strings which do not in- S TA XML system called S TA XML S . This simplified clude tags or potential partial tags, as de- system allows strings to be typed by their conformance scribed for Chardata in Figure 3. to a simplified Pseudo-XML language; it illustrates the generalstring: The set of strings including poten- premises of S TA XML and its use of the stacked type tial incomplete tags, as described by the Bad syntax in an intuitively clear way not clouded by the production in Figure 3. We will also use this finer details of the XML syntax which must be captured entry to denote strings which cannot be a part by more precise XML type systems. of a well-formed Pseudo-XML document. Stacks: These represent strings of tokens. 3.1 Pseudo-XML []: The set of strings described by the document The S TA XML S system tries to determine whether string production in Figure 3; strings in which all expressions conform to the grammar in Figure 3. This tags are properly balanced and nested. This grammar describes the same language as the content includes strings the empty string. production in the full XML specification [4]. [v σ]: The set of all strings consisting of the con- Note that the well-behaved members of the gram- catenation of a string belonging to the set de- mar (Chardata, STag, EmptyTag, ETag) and the set of scribed by type σ, a balanced fragment (type all concatenations thereof do not exhaust the set of all []), a single token of type v, and another bal- strings. We therefore include the Bad terminal, which anced fragment (type []). describes strings which do not correspond to any termi- nal or token in our grammar. Whether the concatena- Figure 5: Semantics of the Types in Figure 4 tion of such strings may constitute a valid XML token (e.g., concatenating “” produces an STag) is beyond the power of S TA XML S Two examples of two strings (bracketed to the left) (see Section 5.2). and their S TA XML S types are presented in Figure 6. For the upper string, the type [STag-head [STag-html []]] 3.2 Syntax and Semantics of S TA XML S indicates that it consists of an “” tag and Any string can be unambiguously tokenized using a a “” with balanced XML fragments before, longest-match lexer and the Chardata, STag, EmptyTag, between, and after them (which are empty, empty, ETag, and Bad descriptions from Figure 3. Each token and “text”, respectively); in the is assigned a type value which is a valid stack element; lower string, the type [STag-p [STag-body []]] functions Chardata tokens have type chardata, STags with the tag similarly, describing a “” and “” with sur- name “x” have types STag-x, and likewise ETags and rounding and interposed balanced fragments (empty, EmptyTags. The syntax and semantics of S TA XML S empty, and “”, respectively). types are given in Figures 4 and 5, respectively. A string is simply a sequence of zero or more typed 3.3 The concatT operator tokens ti , each with type vi : For shorthand, we define a supertype, token, as all STag- x, ETag-x, EmptyTag-x, Chardata, and generalstring, t1 : v1 , t2 : v2 , . . . , tn : vn . and the type string as equivalent to the union of all Every such string is characterized by a single S TA XML S Stacks types. type (σ) which is a stack of such token types (v i s) repre- The type of a whole string is found by building it from senting the unbalanced structure of the string. its constituent tokens using a typed token-wise concate- 3
typeofconcatT (σ, Chardata) =σ (1) typeofconcatT (σ, EmptyTag-x) =σ (2) typeofconcatT (σ, STag-x) =[STag-x σ] (3) text typeofconcatT ([STag-x σ], ETag-x) =σ (4) substring type: [STag−head [STag−html []]] typeofconcatT ([], ETag-x) =[ETag-x []] (5) typeofconcatT ([ETag-x σ], ETag-y) =[ETag-y [ETag-x σ]] (6) typeofconcatT ([STag-x σ], ETag-y) =E where x = y (7) typeofconcatT (σ, generalstring) =E (8) substring type: [STag−p [STag−body []]] for all σ ∈ Stacks and all x, y ∈ XmlTag. text Figure 7: Definition of typeof concatT function is defined for typeof concatT ; however, all such types are Figure 6: Strings and Their S TA XML S Types values which will never be produced by the typeof concatT function, and therefore will not arise in the S TA XML S system. Because only rules 3, 5 and 6 ever shift ele- nation operator, concat T . This operator takes two typed ments onto stacks, a tighter statement of the values of σ arguments: a string (a list of zero or more tokens, whose for which typeof concatT must be defined is: type is a Stack) and a single token (whose type is a Stack- Entry), and returns a typed string in which the token is σ ∈ AppearingSTs ::= startstack (9) appended to the previous list (and whose type is another startstack ::= [STag-x startstack] | endstack Stack). concat T ’s type can be stated most generally as: endstack ::= [ETag-x endstack] | [] string × token → string Clearly, the definition of typeof concatT in Figure 7 ex- The S TA XML S system overloads the return type of hausts this set. concatT to provide a much more precise type for the re- 3.3.1 Type Errors turned value which depends upon the precise types of the two operands. The concat T operator’s type is defined by Rules 7 and 8 give rise to type errors (E), which are anal- a function on types, typeof concatT , which has the type: ogous to parsing errors for an XML parser trying to con- sume strings described by those type stacks; these cases Stacks × StackEntries → Stacks represent concatenations whose results cannot safely be used as part of any well-formed XML document. In other words, the type of the concat T function is the When building a practical S TA XML compiler there result of the typeof concatT function as follows: exists no reason such events must correspond with com- pile errors per se; these are “errors” in the composition concatT (s : σ, t : v) : typeofconcatT (σ, v) of strings which may be intended to constitute XML or We define typeofconcatT using the eight mutually exclu- may not; true (“hard”) type errors arise only if the pro- sive and exhaustive argument pairs in Figure 7. Intu- grammer attempts to use data of this type in a context itively, the value of typeof concatT is the result of pushing for which it is unacceptable (e.g., a function accepting the second argument (a StackEntry) onto the head of the only strings of type []). In common practice, we antici- first argument (a stack of StackEntries) and applying re- pate that the E cases in typeof concatT will be handled as a ductions (in the bottom-up parsing sense) which corre- warning with a twofold action: spond to the document and element productions stated 1. Since the structural problem giving rise to the er- in Figure 3. In essence, the typeof concatT function simu- ror is fairly self-evident from the argument pair lates an iteration of an LL(0) bottom-up parser, in which to typeofconcatT , the programmer is notified that an a single symbol is examined and either shifted onto the apparent illegal concatenation was performed and stack (e.g., Equation 3) or reduced (e.g., Equation 1). provided with the types and the location within While the absence of conflicts is obvious, that our the program where the erroneous concatenation ap- typeofconcatT definition is also exhaustive of the input pears. types (Stacks and StackEntries) may not be immediately 2. The expression is assigned a type which indicates clear. There are members of Stacks for which no value it can not be part of a well-formed XML document 4
(e.g., generalstring), or alternately, the problematic (1). Chardata is absorbed by either the leading Char- sequence could simply be left in place in the stack Data? or the trailing CharData? of the document to accomplish the same end. The latter also allows production; this holds because a single CharData the compiler to “limp forward” and attempt to iden- can capture an arbitrarily long sequence of Char- tify type errors in other expressions derived from data tokens. Therefore: the operation’s result. concatT (s : [], r : Chardata) ∈ document. A more resourceful compiler-writer could, at least in some of the more commonly-occurring cases, infer the (2). EmptyTag-x is absorbed by document as an element intended correct XML structure from the errant se- (one form of element is EmptyTag). Therefore: quence of StackEntries and inject additional code (or at concatT (s : [], (r : EmptyTag-x) ∈ document. least recommend it to the user, complete with line num- bers) which could remedy the imbalances. For example, (3, 4). STag-x and ETag-x tokens are absorbed by STag this type: and ETag, respectively, within element; this re- quires that the concatenation of all intermediate to- [ETag-p [STag-i [STag-p []]]] kens match document, i.e., have the type [], which is the case. Once an expression’s type has become a represents a common HTML mistake where an ap- non-empty stack via the application of 3, no con- propriate corrective measure (add an “” before catenation other than 4 can return the type to []. the “”) is obvious (although the proper place- Therefore: ment within the text between “” and ” is not; concatT ( hence, such a fix-up must not be done silently). Sim- concatT ( ilarly, HTML programmers making the transition to XHTML often do not properly close singleton tags like concatT (s : [], r : STag-x), “” or “”; if the compiler knows the par- s : []), ticular DTD or schema which the document is expected r : ETag-x) ∈ document. to conform to, such cases can also be easily identified and mechanically corrected. (5, 6). Neither of these signatures can ever give rise to a [] type, or to a type which is usable by subsequent 3.3.2 Correctness of concatT concatT s, to produce a result type of [], so these It is the definition of types with the form [v σ] that cases are irrelevant to the proof. supports the stack-reducing rules (1, 2, and 4) giving (7, 8). Neither of these signatures can ever give rise to a our type system sufficient expressive power to com- [] type, or to a type which is usable by subsequent pactly capture XML-like well-formedness while remain- concatT s, to produce a result type of [], so these ing compact and computationally efficient; the sys- cases are irrelevant to the proof. tem simply does not need to represent balanced sub- Whereas the empty string has type [] and conforms to fragments explicitly in the type of the expression be- document, and all concatenations which can give rise to cause they are fully represented by the implicit [] types strings of type [] correspond with strings also conform- within the stack. ing to document, all expressions of type [] conform to document. Theorem 3.1. In the S TA XML S type system, expres- sions of type [] constructed using concat T would be rec- The converse is not necessarily true because the gram- ognized by a parser as matching the “document” pro- mar (Figure 3) does not force STag and ETag to have duction in Figure 3. matching names, where S TA XML S does. Thus, some strings conform to the document production but are not Proof. We will refer to the language described by a pro- well-formed XML and thus have the type E. duction using that production’s name. 3.4 Generalized Concatenation We prove the parallelism between such a parser and A language with only the concat T operator to join strings the result of our type system by induction on the struc- is cumbersome to work with in practice. Consider con- ture of concatenation. The empty string by definition catenating the string “” with a second has type [] and conforms trivially to the document pro- string “”. The concat T operator al- duction. Now consider each of the cases in figure 7. As- lows us to construct each of these two strings individu- suming an initial empty string s with type [] (conforms ally, having the types to document) and a token argument r of the appropriate type: [STag-body [STag-html []]] 5
Normalization with rules 12, 14, and 15 is conflu- σ [ETag-x [STag-x σ ]] ⇒ σ σ (12) ent per Newman’s Lemma [1] because the rewrite sys- σ [ETag-y [STag-x σ ]] ⇒ E where x = y (13) tem will terminate (all rules are length-decreasing) and σ [EmptyTag-x σ ] ⇒ σ σ (14) each of the rules is locally confluent (the rules are non- σ [Chardata σ ] ⇒ σ σ (15) interfering). Because normalization is terminating, the result is canonical; for any given σ 1 and σ2 , there is σ [generalstring σ ] ⇒ E (16) exactly one value typeof concat (σ1 , σ2 ), i.e., concat(s 1 : for all x, y ∈ XmlTag and all σ, σ ∈ Stacks σ1 , s2 : σ : 2) has exactly one S TA XML S type. Normal- ization is performed immediately whenever inferring a S TA XML S type; only normal forms are ever used as the Figure 8: Definition of reduce for S TA XML types of expressions. Rules 13 and 16 identify “errors” in the same sense and as in concatT ; the compiler can terminate, or can re- [ETag-html [ETag-body []]], duce the offending patterns to generalstring, or can no- tify the user and simply leave the offending pattern in respectively, but will not allow us to then concatenate place in the stack, or could even (in some limited cases) the two already-assembled strings. attempt to “repair” the structure of the stack by in- We therefore define the general (less restrictively serting appropriately-placed additional string concatena- typed) concatenation operator, concat, which operates tions into the program. upon two typed strings as follows. In general terms, the type of the operator concat is: Agreement of concat with concat T The two concate- nation operators, concat T and concat, both recognize string × string → string the same sets of strings as having the same S TA XML S types, i.e., any string s constructed token-wise using The S TA XML S type of the resulting string is computed concatT will be judged to have the same type as if it had by the function typeof concat which takes as arguments two been constructed from other previously-concatenated members of Stacks and returns a member of Stacked strings using concat. Types: Theorem 3.2. Consider strings s1 : σ1 and s2 : σ2 , concat(s : σ, s : σ ) : typeofconcat(σ, σ ) where s2 is a list of typed tokens where the function typeof concat is defined as follows: t1 : v1 , . . . , tn : vn and σ2 is typeofconcat (σ, σ ) = fix(reduce, σ σ) (10) typeofconcatT (· · · (typeofconcatT (σ1 , v1 ), v2 ) · · · , vn ) . and σ σ is defined as: Token-wise concatenation (using concat T ) and general σ if σ = [] concatenation (using concat) of the two strings produce σσ= σ if σ = [] and σ = [] (11) identical results with identical types, i.e., [v σ σ] if σ = [] and σ = [v σ ] concat(s1 , s2 ) = The function reduce is a stack rewriting function which concatT (· · · concatT (concatT (s1 , t1 ), t2 ), · · · tn ) searches the stack for innermost sequences of tokens and corresponding with Pseudo-XML productions (Figure 3) and reduces (rewrites) them accordingly. Taking the typeofconcat (σ1 , σ2 ) = fixed-point of reduce ensures that all such patterns are typeofconcatT (· · · removed from the type of a concat result. This effec- typeofconcatT ( tively allows an arbitrary number of paired tags between typeofconcatT (σ1 , v1 ), v2 ) · · · , vn ) . σ and σ to be resolved by a single operation. We will state the function reduce in terms of type nor- malizations, rules which rewrite a stack into a normal form. Normal forms for S TA XML S types are precisely Proof. That the values are the same follows trivially the subset of Stacks which can be constructed using the from the definitions of the two operators. typeofconcatT operator (stated in Equation 9). The reduce In effect, both typeof concatT and typeofconcat function is defined by Figure 8. act as rewrite systems upon the type value 6
document::= prolog element miscs meanings to many types in order to preserve useful dis- prolog::= tinctions. We begin by defining the tokens which must XMLDecl miscs DocTypeDecl miscs | XMLDecl miscs | be recognized in static strings (precisely defined in [4]): miscs DocTypeDecl miscs | XMLDecl Any “” string. miscs miscs: misc miscs | /* nil */ DocTypeDecl Any “” string. misc: Comment | PI | S misc Any combination of XML comments (“”), PIs (“”), EmptyTag | and whitespace (spaces, tabs, end-of-line charac- STag content ETag content: ters, etc). Chardata morecontent | morecontent STag-x Any XML start-tag with the name x. morecontent: ETag-x Any XML end-tag with the name x. fillcontent Chardata morecontent | EmptyTag-x Any self-contained XML tag with the fillcontent morecontent | /* nil */ fillcontent: element | CDSect | misc name x. Chardata Any combination of plain text and unparsed data blocks (CDSects) of the form “”. S Any whitespace sequence (spaces, tabs, end-of-line [vn [ · · · [v1 σ1 ] · · · ]]; this can be shown from the characters, etc). The lexer should give precedence structure of the two functions and the correspondence to assigning strings this type over misc. between their constituent rules: 1 with 15, 2 with 14, 3 generalstring Any string other string (e.g., partial tags, with 11, 4 with 12, 5 with 11, 6 with 11, 7 with 13, and unopened/unclosed comments, etc). 8 with 16. [] The empty string. The only difference between the two as rewrite sys- As before, stacking represents concatenation; e.g., a tems is that typeofconcatT prescribes a particular strategy misc string concatenated with a Chardata string has the for the order in which rewrites are performed (first try basic type [Chardata [misc []]]. We have discarded the to rewrite [v1 σ1 ], then prepend the result thereof with S TA XML S definition of [] as “a valid block of balanced v2 and try again, etc) whereas typeof concat does not. The pseudo-XML”, and as such, there is no longer an im- two systems are therefore equivalent if all of the rewrites plicit block of balanced XML between elements of the described are invariant to the order in which they are stack. Instead, we progressively reduce stack elements applied. This follows directly from their mutual non- and sequences of stack elements to other elements repre- interference, which is evident in the structure of the rule senting productions in the conflict-free grammar (Figure subjects. 9), specifically: prolog Any valid XML prolog (the prolog produc- 4 Enforcing Full XML Well-Formedness tion). The S TA XML system we present in this section en- xmldec Any XMLDecl followed by zero or more miscs. forces many more of the basic grammatic correctness dtd Any DocTypeDecl preceded and followed by zero and well-formedness requirements of the XML specifi- or more miscs. cation [4] by more selectively reducing elements from element Any complete XML element (the element the stack. We will express this system entirely in terms production) preceded and followed by zero or more of the general concatenation operator and type normal- miscs. izations. content Any combination of Chardatas, miscs, and ele- We begin by considering the document and related ments. productions from the XML specification. Since conflict- document Any valid XML document (the document free grammars translate more easily into normalizations, production). we have translated the document EBNF into a conflict- Thus, the general syntax of S TA XML types is defined free BNF specification, shown in Figure 9. as in Figure 10. The types representable in this syntax Clearly, a more involved system than S TA XML S is must then be normalized using the rules of Figure 11. needed to handle this language; we must ensure that Perhaps counterintuitively, we never normalize any the document root consists of precisely one element (the stacks to the document type. While such an eager nor- element in the document production), and we must malization strategy works for left-to-right parsers (like allow for a variety of ways of constructing the document the concatT operator), in the presence of arbitrarily- prolog. To retain sufficient information in S TA XML ordered concatenation it does not preserve enough infor- types for these purposes, we must give more narrow mation for us to correctly identify late prefixing of cer- 7
x, y ∈ XmlTag ::= a | b | ... | aa | ab | ... | σ [XMLDecl σ ] ⇒ σ [xmldec σ ] (17) ba | bb | ... | aaa | ... σ [misc [xmldec σ ]] ⇒ σ [xmldec σ ] (18) v, w ∈ StackEntries ::= XMLDecl | DocTypeDecl | σ [S [xmldec σ ]] ⇒ σ [xmldec σ ] (19) misc | STag-x | ETag-x | σ [DocTypeDecl σ ] ⇒ σ [dtd σ ] (20) EmptyTag-x | Chardata | σ [misc [dtd σ ]] ⇒ σ [dtd σ ] (21) generalstring |prolog | σ [S [dtd σ ]] ⇒ σ [dtd σ ] (22) xmldec | dtd | element | σ [dtd [misc σ ]] ⇒ σ [dtd σ ] (23) content | document σ [dtd [S σ ]] ⇒ σ [dtd σ ] (24) σ ∈ Stacks ::= [] | [v σ] σ [dtd [xmldec σ ]] ⇒ σ [prolog σ ] (25) σ [misc [prolog σ ]] ⇒ σ [prolog σ ] (26) σ [S [prolog σ ]] ⇒ σ [prolog σ ] (27) Figure 10: S TA XML Syntax of Types σ [misc [element σ ]] ⇒ σ [element σ ] (28) σ [S [element σ ]] ⇒ σ [element σ ] (29) tain tokens to prologs. For example, a string of the type σ [element [misc σ ]] ⇒ σ [element σ ] (30) [element [dtd []]] is a document. If we later prepend this σ [element [S σ ]] ⇒ σ [element σ ] (31) string with an xmldec, it is still a document; however, a σ [Chardata σ ] ⇒ σ [content σ ] (32) document cannot be prepended by a xmldec because it σ [element [content σ ]] ⇒ σ [content σ ] (33) may already begin with one. To work around this prob- σ [misc [content σ ]] ⇒ σ [content σ ] (34) lem, we define document explicitly as the union of all σ [S [content σ ]] ⇒ σ [content σ ] (35) normalized S TA XML types which describe valid docu- σ [content [misc σ ]] ⇒ σ [content σ ] (36) ments, presented in Figure 12. Intuitively, an expression σ [content [S σ ]] ⇒ σ [content σ ] (37) with any of these four S TA XML types can be safely pro- σ [content [element σ ]] ⇒ σ [content σ ] (38) moted to type [document []]. σ [misc [misc σ ]] ⇒ σ [misc σ ] (39) This formulation supports a great deal of generality; σ [S [misc σ ]] ⇒ σ [misc σ ] (40) for example, if a string may be either a xmldec or a σ [misc [S σ ]] ⇒ σ [misc σ ] (41) dtd, it can still be prepended to an element to produce σ [misc [document σ ]] ⇒ σ [document σ ] (42) a document. For this reason, we define a set of super- σ [S [document σ ]] ⇒ σ [document σ ] (43) type relations (in addition to the document definition) σ [element [STag-x σ ]] ⇒ σ [STag-x σ ] (44) which allow such strings to preserve a sufficiently de- σ [content [STag-x σ ]] ⇒ σ [STag-x σ ] (45) scriptive type. These relationships are represented in σ [misc [STag-x σ ]] ⇒ σ [STag-x σ ] (46) Figure 13, where arrows point from supertypes to sub- σ [S [STag-x σ ]] ⇒ σ [STag-x σ ] (47) types. While these relations are never applied in the σ [ETag-x [element σ ]] ⇒ σ [ETag-x σ ] (48) course of type normalization, they may be used by the σ [ETag-x [content σ ]] ⇒ σ [ETag-x σ ] (49) type inference algorithm when trying to reconcile mul- σ [ETag-x [misc σ ]] ⇒ σ [ETag-x σ ] (50) tiple flows of execution which produce expressions or σ [ETag-x [S σ ]] ⇒ σ [ETag-x σ ] (51) variables with differing types (as discussed in Section σ [ETag-x [STag-x σ ]] ⇒ σ [element σ ] (52) 6). Notice that this is not a strict hierarchy, but in- σ [EmptyTag-x σ ] ⇒ σ [element σ ] (53) cludes multiple supertypes for misc (and, transitively, S) σ [ETag-y [STag-x σ ]] ⇒ E (where x = y) (54) and element; this is not problematic, as our reconcilia- tion algorithm depends only upon finding least common σ [generalstring σ ] ⇒ E (55) supertypes, which is straightforward so long as the re- lationships follow a DAG. These inter-element subtype relations are then used to reconcile full S TA XML types Figure 11: Normalization of S TA XML types using an algorithm given fully in [3]. Intuitively, this al- gorithm examines two S TA XML types starting at their normalized, and the algorithm repeats until both stacks tails; where the corresponding StackEntries are the same are exhausted. or have a non-generalstring common supertype, that is pushed onto the result stack ρ. Failing that, the algo- 5 S TA XML Variants rithm attempts to reconcile either of the tail elements with a []-typed string, deferring the other tail element The same techniques employed to develop the S TA XML until the next iteration. Failing that, the generalstring el- system can be used to further refine it to identify more ement is pushed onto the result stack ρ. After consuming restrictive languages and to account for more general an element from the argument types, the result type ρ is techniques for constructing XML fragments. Below we 8
[ [document []] = [element [prolog []]] σ [element-li [element-li σ ]] ⇒ σ [element-li σ ] [ σ [ETag-ol [element-li [STag-ol σ ]]] ⇒ σ [element-ol σ ] [element [xmldec []]] σ [ETag-ol [STag-ol σ ]] ⇒ E [ [element [dtd []]] [element []] Figure 14: Type normalizations for ordinal lists This can be easily handled by excluding from the Figure 12: The [document []] Type list of normalizations rules which reduce STag-x, ETag-x, or EmptyTag-x for any unknown x. prolog content document Nesting For each tag name, there is specified a set of tag names which may appear as immediate children in mldec dtd misc element the document tree. This is handled by replacing the element type with element-x types for each tag name x, replac- S [] ing the generic element consumption rules (44, 48) with tag-specific consumption rules for only the al- Figure 13: Sub/Supertype relationships among lowed child tags, and doing away with the content S TA XML element types rules which subsume elements. Variations on this theme are needed to handle specific content mod- briefly outline S TA XHTML and S TA XML A , which re- els; for example, the ol tag must have one or more flect each of these two possibilities, respectively; com- li children, so normalizations for element-li and plete specifications are omitted for want of space, but element-ol are performed per Figure 14. can be found in [3]. Exclusions Certain tags must not contain content which includes certain other tags, at any depth of nesting. 5.1 S TA XHTML This is implemented by augmenting every stack Certain classes of XML schemata (descriptions of par- element type with a vector of five bits, one for each ticular XML-based languages) can be enforced by re- of the above exclusions. All STag-a, ETag-a, and placing the general-purpose S TA XML reduction rules EmptyTag-a elements have the first bit set, all start presented above with specialized reduction rules which and end tags of the excluded tags within “” only successfully normalize the XML structures which have the second set, and so on. Whenever a type legally fit within the schema’s described XML language. normalization is performed, the values of the bit As an example, we have defined a syntax of types and set field are propagated even when the particular Stack- of normalization rules called S TA XHTML which en- Entries are rewritten to some other StackEntries; in forces correct construction of XHTML/1.0 [14] docu- essence, once the “taint” of an “” tag is intro- ments. duced to a type, all normalizations which may re- The S TA XHTML application works in principle in move the explicit presence of an “” element very much the same way S TA XML does, applying very in the stack preserve this taint. This is compli- similar type normalization rules; the difference is, in- mented by the requirement that, when reducing a stead of defining rules for all x (for all possible tag start-end pair corresponding to any of these ex- names), we will define rules for particular values of clusions, the contents must not have the excluded x, tailoring the normalization behaviors to the particu- taint bit set; i.e., when reducing a stack of the form lar requirements and content models corresponding with σ [ETag-button σ [STag-button σ ]], the type each tag name. We will also support exclusions (rules σ must not have the third flag (corresponding with prohibiting particular elements from appearing as de- “” exclusions) set, otherwise a type er- scendants of others at any depth) by augmenting each ror (E) is raised. These checks are performed in type stack element with a “taint” vector which preserves parallel with the basic normalizations derived as per exclusion behavior without preventing otherwise valid the previous section. 1 stack-reducing normalizations. The particular requirements of XHTML/1.0 fall into 5.2 S TA XML A three categories: Commonly used languages (such as PHP) will not nec- Tags The set of allowable XML tags is a finite, defined essarily provide us with a static tokenization/typing of set; no unknown attributes may appear. text fragments which will be concatenated to form XML 9
1: echo "Class: "; 6 S TA XML for PHP 2: foreach($article_class as $ack=>$acr) { 3: echo "" . htmlentities($acr) . sary to support all of the S TA XML variants discussed in 8: ""; this paper. In spite of being a highly dynamic language, 9: } a meaningful and very useful subset of PHP4 is amiable A: echo "\n"; to static analysis. This section outlines some of the principles and de- Figure 15: Example PHP4 Code tails of our implementation, “S TA XML for PHP”. A web interface to the current version of our implemen- tation is available, and we expect to release the complete documents. As such, we present here an extension to source code (including both the web and command-line our type model (based upon the concept of a finite-state versions) in the near future. 2 lexer) for inferring the presence of such tags in modestly PHP is not a well formalized language, in the sense generalized string concatenation procedures. that there does not exist a public document which for- Consider, as a motivating example, the snippet of mally defines its syntax, grammar, and semantics; these PHP4 code in Figure 15, taken from an existing web ap- are described informally by the ubiquitously available plication. Lines 1 and 8 include string constants which PHP Manual [11], but the formal specification is to be can obviously be recognized by a static string lexer as had only by understanding the lex, yacc, and C code whole XML start, end, and empty tags. However, the of its implementation. Our implementation’s lexer and output of lines 3-7 also represents a valid start-end tag parser were derived directly from the php-4.3.0 source; pair; since the construction of the open tag spans mul- we used the remainder of the Zend engine code to guide tiple expressions (not to mention multiple paths of ex- our interpretation of the syntax and grammar and inform ecution), a simple static string lexer is not sufficient to a few of the basics of our type system. identify the presence of the STag-option substring type. Our inference algorithm annotates all expressions Based upon the previous sections, the reader should with a root type: UNSET, NULL, FLOAT, INTEGER, be able to imagine how a range of type systems could be BOOL, ARRAY, OBJECT, UNKNOWN, or STAXML formulated to detect such situations. In essence, all such (all “string” values are assigned STAXML types). Each systems will use the lexer to identify partial tags and use root type has its own set of supplementary data to inform normalization rules to reduce concatenations thereof to the type; for example, an OBJECT might include an in- the S TA XML tag types discussed above. One simple dication of the class from which the object is derived. approach adds three more string types to the lexer: The STAXML type is supplemented by the actual in- OpenTag-x: Any string of the form “”. ting a warnings; the generalstring type is also used when and adds a few straightforward normalization rules: an expression may be any of a set of types which cannot be reconciled with each other. There is a simple subtype/supertype hierarchy among σ [plaintext [OpenTag-x σ ]] ⇒ σ [OpenTag-x σ ] (56) the root types as well as among S TA XML type stacks. The numeric root types have the expected relationship: σ [RBracket [OpenTag-x σ ]] ⇒ σ [STag-x σ ] (57) σ [EmptyTagClose [OpenTag-x σ ]] BOOL
engine; this allows the user to provide a configuration {type_env Gamma2, staxml_type t} = ast_inference(Gamma, seq->cmd1); file which specifies the subtype relations and normaliza- return ast_inference(Gamma2, seq->cmd2); tions to apply while checking a program. This makes } experimenting with and debugging new S TA XML vari- ants much easier than it would be were each system to 6.2 Conditionals be hard-coded into the compiler. Control forks must also be dealt with. Consider the 6.1 Type Inference for PHP IFTHENELSE abstract syntax node, which has three children: a predicate expression, an “if true” expres- Type inference is performed by structural inference upon sion/command (possibly empty), and an “if false” ex- an abstract syntax tree (AST) constructed from PHP pro- pression/command (possibly empty). The implementa- gram code. An inference algorithm is defined for ev- tion of type inference for IFTHENELSE then has the ery kind of abstract syntax node; that algorithm takes form: as input the node itself and a type environment (a map from variable names to the types associated with those ast_ite_inference(type_env Gamma, ast_ite ite) { {type_env Gamma2, staxml_type t2} = variable names before the execution of this element of ast_inference(Gamma, ite->predicate); syntax), and returns a new type environment (the one in {type_env Gamma3, staxml_type t3} = ast_inference(Gamma2, ite->iftrue); effect when this element of syntax has completed exe- {type_env Gamma4, staxml_type t4} = cution) and a type for the value of the block itself (e.g., ast_inference(Gamma2, ite->iffalse); type_env Gamma5 = reconcile(Gamma3, Gamma4); when inferring the type of expressions). The inference staxml_type t5 = type_lcs(t3, t4); algorithms for each kind of AST node also determine } the type of the string (if any) that block would place into the output buffer via echos and appends it to a run- Two helper functions do most of the interesting work: ning accumulator of the output buffer’s S TA XML type. type lcs The LCS of some set of types is their least (The output buffer can be treated like any other variable, common supertype. Pairwise least common super- having its type stored in the type environment.) The al- types are determined using the algorithm discussed gorithms for several important pieces of PHP4 syntax above. Naturally, if the only reconciliation between are presented below; these and others are presented with disparate stacks is to use generalstring, the appro- pseudo-code in greater depth in [3]. priate warning/error should be emitted by the com- At a high level, this inference algorithm is run on piler. the node representing the outermost execution block of reconcile This operation applies the LCS opera- a PHP script (code not enclosed within a function or tion across a pair of type environments. It es- class declaration). When the inference algorithm has sentially iterates over the two environments, tak- completed for the top-level program node, its “echo ing the LCS of their corresponding types for each type” represents the total type of the output stream of named variable. If either environment lacks a type the program; it is trivial to check if this type is accept- for a given variable, the reconciled environment able (e.g., any form of [document []] in S TA XML, [] in uses the LCS of its type in the other environment S TA XML S ). and an appropriately-chosen empty-string type. In For example, the ECHO abstract syntax node rep- S TA XML S , this is []; in S TA XML, we choose [S []] resents a command to print some value to the output to allow it to be more easily reconciled with other stream. The inference algorithm for ECHO looks like non-empty strings. this (presented in pseudo-code with a C-like syntax): This handles simple either-or control forks; constructs ast_echo_inference(type_env Gamma, ast_echo echo) { like SWITCH blocks demand significantly more book- {type_env Gamma2, staxml_type t} = ast_inference(echo->value); keeping (particularly when dealing with BREAKs), but Gamma2.echo_type = in principle work on much the same premise: the end- type_concat(Gamma2.echo_type, t); return {Gamma2, type_unset()}; ing type environment must represent the reconciliation } of the type environment at all possible exit points after all possible execution paths which lead to them. Similarly, the SEQ abstract syntax node represents a se- quence of two commands or expressions; as in many 6.3 Loops imperative languages, such sequences can (under some Loops can be challenging for type inference algorithms, conditions) themselves be treated as values, with the lat- particularly when their control flows can be further mod- ter expression determining the value and its type. Thus, ified by exceptions like BREAK and CONTINUE com- type inference for SEQ looks like this: mands. Ignoring such exceptions for the moment, con- ast_seq_inference(type_env Gamma, ast_seq seq) { sider the WHILE abstract syntax node; it contains two 11
pieces of abstract syntax, the predicate and the loop Start body. The predicate is executed once, after which there may be zero or more iterations of executing the body fol- Predicate lowed by the predicate. Our inference algorithm should Exit be able to provide us with a type environment in which Body Body Body every symbol’s type is the “tightest” (least) supertype of (Continued) (Complete) (Broken) all types that variable could take on whenever the while loop might exit. Conceptually, we want to find: Figure 16: Program flow of the WHILE control con- struct reconcile(ast_inference(predicate), ast_inference(SEQ(predicate, body,predicate)), Therefore, the type environment can be changed at ast_inference(SEQ(predicate, most nd times, where n is the number of symbols and body,predicate, d is the depth of the type tree. body,predicate)), ast_inference(SEQ(predicate, body,predicate, Support for Break and Continue The break and con- body,predicate, tinue constructs are dealt with using a more involved body, predicate)), inference algorithm and a pair of environment accumu- ... ); lators for break and continue, respectively. In prin- ciple, the break environment accumulator contain the and so on, ad infinitum; clearly, however, we cannot reconciliation of the environments at the points of all explicitly infer and reconcile an infinite set of environ- BREAKs within a block of code, with the continue envi- ments in finite time (let alone tractable time). Instead, ronment accumulator behaving similarly for CONTIN- it can be demonstrated that iterating the inference algo- UEs. These values are stored in global variables by the rithm over a finite number of repetitions of the (body, inference algorithm; control structures which rely upon predicate) sequence is always sufficient to infer the type them examine and modify them to capture the desired environment after an unbounded number of iterations, behavior. Thus, BREAK’s inference algorithm is essen- where the number of iterations scales linearly by the to- tially: tal execution length of the (body, predicate) sequence. More precisely, we perform reconciliations between ast_break_inference(type_env Gamma, ast_break b) { iterations of the loop until the result of the reconciliation atbreak_gamma_accumulator = operator has reached a fixed point, the point at which the reconcile(atbreak_gamma_accumulator, type environment has become as general as it needs to Gamma); return {Gamma, type_unset()}; be to account for any number of executions of the loop. } Ignoring BREAK and CONTINUE, our implementation of inference for WHILE looks like this: and similarly for CONTINUE, where if either argument ast_while_simple_inference(type_env Gamma, to reconcile is NULL, it simply returns the other ar- ast_while w) { {Gamma1, t1} = gument. ast_inference(Gamma, w->predicate); Our full inference algorithm for the WHILE loop is do { GammaLast = Gamma1; based upon the set of control paths that may be taken {Gamma1,t1} = ast_inference(GammaLast, through it, illustrated in Figure 16. In essence, we are SEQ(w->body,w->predicate)); Gamma1 = reconcile(GammaLast, Gamma1); simply maintaining a local accumulator of every type } while (Gamma1 != GammaLast) environment which could exist at a point in execution return {Gamma1, type_unset()}; } corresponding to an edge to the “Exit” node. A full algorithm for the WHILE loop is given in Fig- Theorem 6.1. The fixed-point type inference algorithm ure 17. The actual algorithm implemented in our com- for S TA XML types terminates. piler is slightly more complicated in that it also accounts for multi-level BREAKs and CONTINUEs (a single Proof. A sketch of the complete proof, presented in [3], BREAK can be used to exit multiple layers of nested is as follows: WHILEs, for example). This extension is handled us- There are finitely many symbols in any finite block of ing stacks of atcontinue gamma accumulators syntax. and atbreak gamma accumulators in place of the For any given symbol, its type after iteration i of the preserve cga and preserve bga variables, and algorithm must be a supertype of its type after iteration by adding an appropriate number of iterations up the ac- i − 1 of the algorithm. cumulator stacks to the BREAK and CONTINUE infer- The sub/supertype tree has finite depth. ence algorithms. 12
ast_while_inference(type_env Gamma, ast_while w) { {while_gamma_accumulator,t1} = mation from the map (if that function with those partic- ast_inference(Gamma, w->predicate); ular argument types has been encountered before) or it preserve_cga = atcontinue_gamma_accumulator; preserve_bga = atbreak_gamma_accumulator; will compute it using syntax-directed inference upon the atcontinue_gamma_accumulator = NULL; function body. Handling reads from and writes to global atbreak_gamma_accumulator = NULL; do { variables would requires that we also index results based hold_wga = while_gamma_accumulator; upon the types of those global variables at call-time and hold_bga = atbreak_gamma_accumulator; {Gamma2, t2} = ast_inference( track their types at return time; this extension is not yet reconcile(hold_wga, included in our implementation, although some of nec- atcontinue_gamma_accumulator), w->body); essary code is already present in the implementation. {Gamma3, t3} = ast_inference(Gamma2, The returned types and type environments are deter- w->predicate); while_gamma_accumulator = mined by identifying all possible exit points from the reconcile(hold_wga, Gamma2, body of the function and taking the reconciliation of the atbreak_gamma_accumulator); } while ((while_gamma_accumulator == hold_wga) && types and type environments at each of those points, i.e., (atbreak_gamma_accumulator == hold_bga)); at every RETURN and at the end of the function body. atcontinue_gamma_accumulator = preserve_cga; atbreak_gamma_accumulator = preserve_bga; Only the RETURN statement can give the function a re- return {reconcile(while_gamma_accumulator, turn value, so if a function contains any RETURNs of atbreak_gamma_accumulator), type_unset()}; values and can also reach the end of its body without } RETURNing, a warning is thrown (as the function might return a typed value but also might not return anything Figure 17: S TA XML for PHP inference algorithm for at all). WHILE loops Given the inherent difficulty of type inference for re- cursive polymorphic functions [7], we replace inference in cases of recursion with validation of a typing assump- The algorithm for DO is similar; the principal differ- tion: if ever a function is found to be able to recursively ences are that DO executes the body before the predicate call itself (whether directly or indirectly), that function is evaluated and that CONTINUE causes the body of the must be “pure” (it must contain no accesses to global DO block to be re-started without the predicate being or static variables and must have no by-reference ar- re-evaluated. guments) and its echo and return types must both cor- Because FOR and FOREACH loops (the two re- respond with some pre-defined S TA XML type which maining loop constructs in PHP) are transformed into varies with the particular system; for S TA XML S , the de- WHILE loops in the abstract syntax, they are handled faults for both are []; in S TA XML, the defaults for both by the above algorithm. are [content []]. If a type inference pass of the function (relying upon this assumption) produces a result which 6.4 Functions disagrees with the assumption (i.e., if our assumption is PHP functions are by their nature polymorphic even in disproven), a type error is thrown and we type the func- the native PHP type system; their declarations include no tion as returning and echoing generalstring. type annotations, so they can be called using any type 6.5 Limitations of arguments (and can consequently return any type of value and have any type of side-effect upon the output Our implementation has minimal or no support for sev- stream’s type and the global type environment via ac- eral PHP4 constructs: cess to global variables). What’s more, function argu- Arrays and Records The “array” mechanism in PHP ments can be passed either by value or by reference, and is used to implement both arrays (where all mem- the by-reference behavior can be activated by either or bers are of a single type) and records (where each both of the function’s declaration and the syntax of a keyed member may have a distinct type and that function call itself (although the forthcoming PHP5 re- key consistently corresponds with that type). Since lease appears to be omitting support for function calls the language includes no type annotations to indi- requesting the call-by-reference semantic). cate the programmer’s intent, our inference algo- In the common case in which there is no recursion and rithm assumes that the intended semantic is that of no global variables are used, the algorithm is straight- the array, i.e., the type of the array itself is “ARRAY forward. For each declared function, we construct on OF x” where x is the LCS of all values assigned as demand a map from argument types to a tuple of return members to that array. type, echo type, and types of arguments at return points Classes and Objects The implementation currently in- (in case any are called by reference); each time a func- cludes no support for classes, i.e., it does not in- tion call is encountered, it will either draw its type infor- fer types for object properties or methods, and sim- 13
You can also read