SQL Multimedia and Application Packages (SQL/MM) Jim Melton Andrew Eisenberg Oracle, Sandy, UT 84093 IBM, Westford, MA jim.melton@acm.org andrew.eisenberg@us.ibm.com many incompatible extensions to SQL would be de- Introduction fined by various data management communities, the end result being a situation in which no single prod- Regular readers of this column will have become uct could possibly implement all of the extensions familiar with database language SQL — indeed, most because of conflicts in keywords (and other related readers are already familiar with it. We have also conflicts). discussed the fact that the SQL standard is being pub- A summit meeting was held in Tokyo later in lished in multiple parts and have even discussed one 1992 to seek a solution to the dilemma posed by the of those parts in some detail[1]. conflicting demands on SQL extensions. By that Another standard, based on SQL and its struc- time, the SQL standards committees were in the tured user-defined types[2], has been developed and process of adding object-oriented extensions to SQL published by the International Organization for Stan- and a number of SQL vendors had indicated their dardization (ISO). This standard, like SQL, is divided support for what is often called the “object-relational into multiple parts (more independent than the parts model”. Based on suggestions from several of those of SQL, in fact). Some parts of this other standard, vendors, the Tokyo summit developed the notion of a known as SQL/MM, have already been published and second standard that would define several “class li- are currently in revision, while others are still in braries” of SQL object types, one for each significant preparation for initial publication. category of complex data. In this issue, we introduce SQL/MM and review The structured types defined in such libraries each of its parts, necessarily at a high level. would naturally be first-class SQL types that could be Jim Melton and Andrew Eisenberg accessed through ordinary SQL:1999 facilities, in- cluding expressions that invoke SQL-invoked rou- SQL Multimedia and Application tines associated with such types (that is, methods). The proposed standard was immediately known Packages — SQL/MM as “SQL/MM” (MM for MultiMedia). A number of In late 1991 or early 1992, a small group of text candidate data domains were suggested, including search engine vendors, operating under the auspices full-text data, spatial data, image data (still and mov- of the IEEE, released a specification for a language ing), and others. Responsibility for SQL/MM’s de- called SFQL (Structured Full-text Query Language). velopment was given to the same ISO subcommittee The goal of SFQL was to define extensions to SQL as SQL (at that time, JTC1/SC21, but now that would be suitable for applying full-text searches JTC1/SC32), with the hope that domain experts to repositories of documents. would attend to develop the specifications for each The proposal was given significant attention by data domain. the full-text community, but was immediately criti- Like SQL, SQL/MM is a multi-part standard. cized by several other data management communities Unlike SQL, the various parts of SQL/MM are quite on the grounds that SFQL “hijacked” many useful independent from one another. However, there is one keywords that were in common use by those other part that is common to the remainder of the word. communities. For example, the keyword CONTAINS Part 1, known as the Framework[3], provides defini- was proposed by SFQL to mean “the indicated unit of tions of common concepts use in the other parts and text contains the supplied word or phrase”, but the outlines the definitional approach used by those other spatial data community used the same keyword to parts. In particular, it describes the manner in which mean “one spatial entity contains a second spatial the other parts use SQL’s structured user-defined entity”. While the high-level semantics of the word types to define the types required by the subject mat- may seem to be quite similar in each case, the actual ter of each part. code required to implement it is dramatically differ- ent. This controversy was sufficiently generalized that the SQL standards organizations realized that
Full-Text document FULLTEXT ) in which the docno column contains a value that The term “full-text” (or, if you prefer, “full text”) is captures some document identifier and the docu- normally applied to textual data that differs from or- ment column contains a full-text document. dinary character string data principally in its length, We could retrieve from that table the identifier of but also in database-specific operations that can be documents about full-text searching that contain applied to it. Ordinary character strings are usually words closely related to “standard” in the same para- indexed by their entire values, but special types of graph as words that sound like “sequel” by using a indexes are defined for full-text data; such indexes query like this: might record information about the proximity of words and phrases to one another or about words that SELECT docno appear in a document and related words that do not FROM information appear in the same document. Full-text data is subject WHERE document.CONTAINS to search operations that are normally not applied to ('STEMMED FORM OF "standard" “simple” character strings. It’s worth pointing out IN SAME PARAGRAPH AS that “full-text operations” are quite different than the SOUNDS LIKE "sequel"') = 1 sort of pattern matches (such as regular expressions) That query retrieves the docno column from the with which most computer software people are inti- mately familiar. information table for every document for which The SQL/MM Full-Text standard[4] defines a the value returned by the CONTAINS method ap- number of structured user-defined types (henceforth, plied to the document column is 1, meaning true. just “UDTs”) to support the storage (presumably in The parameter passed to that method uses three dif- an object-relational database) of textual data. One of ferent full-text operations: STEMMED FORM OF these types is named FullText and it supports the con- will find any of several words derived from “stan- struction of full-text data values, testing whether that dard”, such as “standards” and “standardization”; IN data contains specified patterns, and conversion of SAME PARAGRAPH AS requires that a second that data to ordinary SQL character strings. The word (or phrase!) appear in the same paragraph as the specification of the FullText type includes a number stemmed word; and SOUNDS LIKE finds words that of methods that prepare the value associated with an are pronounced (presumably in English, since we instance of the type for the application of full-text didn’t specify a different language) like “sequel” (of searches, as well as Boolean methods that perform which “SQL” might be a case). the searches themselves. In addition to the FullText type, a number of ad- Spatial ditional types are defined to represent various sorts of Many enterprises need the ability to store, manage, patterns that can be used in full-text searches. Search and retrieve information based on aspects of spatial patterns can be quite complex, including searching data, such as geometry, location, and topology. Ap- for text that includes specific words, words stemmed plications making use of spatial data include auto- from (such as the past tense of a verb or the plural of mated mapping, facilities management, geographic a noun) specified words, words with similar defini- systems, graphics, multimedia, and even integrated tions, and even words that sound like a given word. circuit design. The SQL/MM Spatial standard[5] de- Linguists among our readers will know that fines SQL:1999 structured user-defined types and some languages are much more amenable to com- associated methods to provide the ability to support puter identification of components of text than others. such applications. For example, most Western languages use white By its very nature, spatial data often represents space to separate words from one another and use 2-dimensional and 3-dimensional data. SQL/MM special punctuation (such as a period, or full stop) to Spatial currently supports 0-dimensional (point), 1- separate sentences. Other languages, such as Japa- dimensional (line), and 2-dimensional (“flat” shape) nese, do not separate words from one another by data; future revisions might support 3-dimensional spaces, depending primarily on context to distinguish (volumetric shapes) and possibly data of even higher words. SQL/MM Full-Text is generally acknowl- dimensions. edged to have better support for languages for which There are an astonishingly large number of spa- automatic distinction of language tokens (such as tial reference systems in common use, the vast major- words) is relatively easy. ity of them used to describe geographic entities and Consider the following SQL table: concepts on the surface of our (relatively) spherical CREATE TABLE information ( planet. Many of those spatial reference systems deal docno INTEGER, with large structures for which the curvature of the
planet is significant; as a result, various systems have Most Spatial types have accessor methods that evolved to describe structures in particular regions permit applications to extract fundamental informa- (e.g., countries, states and provinces, etc.) for which tion about instances of the type, such as determining the impacts of planet curvature vary from the impacts the values of the X and Y coordinates of a point. in other regions. (For example, lines of longitude Consider the following table definition: converge towards one another as one moves close to CREATE TABLE CITY ( the poles—seemingly parallel lines of longitude are NAME VARCHAR(30), in fact not parallel.) POPULATION INTEGER, Support for these spatial reference systems are CITY_PARKS VARCHAR(30) ARRAY[10], economically critical to the design of SQL/MM Spa- LOCATION ST_GEOMETRY ) tial, because the largest users of spatial data man- agement systems are often governmental bodies and We can determine the area of San Francisco by exe- very large commercial enterprises that have to deal cuting a query like this: with geographic data. Such users include local gov- SELECT location.area ernments (city planning, traffic management, acci- FROM CITY dent investigation), state and provincial governments WHERE name = 'San Francisco' (highway planning, natural resource management), national governments (defense, border control), ex- The expression location.area retrieves the tractive industries (mineral and water location), and area attribute of the ST_Geometry structured type farming (plot allocation). Indeed, SQL/MM Spatial’s value stored in the location column of the row design seems to more naturally support geospatial corresponding to San Francisco. (Retrieving the value data than smaller-scale data such as integrated circuit of an attribute of a structured type instance is equiva- design and computer graphics. lent to invoking the accessor method on that attrib- SQL/MM Spatial defines several type hierar- ute.) chies. One of those hierarchies has as its most gener- SQL/MM Spatial is closely related to, and fun- alized type (that is, its maximal supertype) a type damentally aligned with, other spatial standards being called ST_Geometry. That type is not instantiable developed by another ISO Technical Committee, TC (meaning that no instances of it can be created— 211 (Geomatics) and by the Open GIS Consortium Spatial defined less than a half-dozen such types), but (“GIS” stands for “Geographic Information Sys- it has a number of (about a dozen) subtypes that are tems”). Keeping standards being developed in all instantiable, such as ST_Point, ST_Curve, and three forums has proved challenging, but all partici- ST_MultiPolygon. pants seem committed to doing so. A type (not a subtype of ST_Geometry) called ST_SpatialRefSys is used to describe spatial refer- ence systems. Every spatial value that participates in Still Image a given query must be defined in the same spatial One of the fastest growing applications of computers reference system, although a future version of the is storage and processing of visual images such as Spatial standard might relax that restriction. photographs. Many enterprises expend tremendous In a future version of SQL/MM Spatial that is resources on the acquisition, storage, and manage- currently under development, another pair of types, ment of collections of images, including graphics, ST_Angle and ST_Direction, are used to capture in- paintings, and photographs. Such data has tremen- formation about various angles and directions that are dous business value and represents large monetary needed when storing and managing spatial informa- outlays. One of the most challenging aspects to han- tion. dling image data is that of locating an image already There are many operations that can be performed in your possession. on Spatial data. Among the most common operations SQL/MM Still Image[6] represents a part of the are: construction of a straight line from two points or solution to those problems. This part of the SQL/MM from one point, a direction, and a distance; construc- standard provides structured user-defined types that tion of a polygon from several lines, from several allow you to store new images into a database, re- points, or from a point and a collection of directions trieve them, modify them in various ways, and—most and distances. Other important operations are detec- importantly—to locate them by applying various tion of whether two lines intersect, whether two areas “visual” predicates to your collections of images. overlap or are adjacent to one another, whether a line In SQL/MM Still Image, images are represented is tangent to a curve, and whether two polygons share using an SQL:1999 structured type called a boundary. SI_StillImage. This type stores collections of picture elements (pixels) representing 2-dimensional images.
(Of course, images of 3-dimensional objects are very SQL/MM Still Image, but it is possible that some common, but the images themselves are 2-dimen- future part of SQL/MM will be oriented towards sional.) Images can be stored in any of several for- moving images. mats, depending on what the underlying implemen- tation supports—for example, formats such as JPEG, Data Mining TIFF, and GIF are commonly supported as input and output formats, as well as formats in which images The parts of SQL/MM that we’ve presented so far in are stored and manipulated. The SI_StillImage type this column are all very reasonably described as ori- also captures information about each image, such as ented towards the handling of multimedia data. How- its format, its dimensions (height and width in pix- ever, as you saw in the early sections of the column, els), its color space, and so forth. the full name of the SQL/MM standard is SQL Mul- Methods applied to SI_StillImage instances in- timedia and Application Packages. In fact, work was clude routines to scale an image (change its size pro- initiated in early 2000 on a new part of SQL/MM that portionally), to crop an image (remove undesired does not address multimedia data, but instead defines parts), rotate an image (such as changing its orienta- an application package. tion from horizontal to vertical), and creating a SQL/MM Data Mining[7] defines SQL struc- “thumbnail” image (a lower resolution image used tured user-defined types—including methods on the for quick display). types—to address an important aspect of modern data Another group of data types are used to describe management: the discovery of previously unknown, various features of images. The SI_AverageColor but important, information buried in large quantities type is used to represent the “average” color of a of data that might have been collected for other, quite given image; this value may be used in locating im- distinct reasons. ages in collections (imagine wanting to find an image Data mining is not a new concept; indeed, com- that is primarily green to be used in advertising out- panies have long wanted to use data collected in the door furniture). The SI_ColorHistogram type pro- ordinary course of business as a source of informa- vides information about the colors in an image at a tion about their customers or other resources. A num- finer level of granularity than the image’s average ber of relatively small, but important, companies color; it indicates how much of each color is found in were founded during the 1990s to provide enterprises an image. The SI_PositionalColor type represents the with data mining products, some of them based on location of specific colors in an image, supporting relational database systems, but most of them dedi- queries such as “since sunsets at sea have red and cated applications that require importing data stored orange above dark blue, find me images with those in another repository and reorganizing it into struc- color characteristic”. Finally, the SI_Texture type tures unique to a particular data mining approach. allows the recording of information such as coarse- SQL/MM Data Mining takes a different view of ness, contrast, and direction of granularity. An the problem: It attempts to provide a standardized SI_FeatureList type permits recording all of the fea- interface to data mining algorithms that can be lay- tures described in this paragraph for each image. ered atop any object-relational database system and By combining several features of an image, it is even deployed as middleware when required. possible to write queries that can retrieve from a very In most data management environments, applica- large image base a much smaller collection of images tions pose questions to the data repositories that re- from which you can quickly select the exact image trieve information based on specific criteria. By you want. It is also possible to screen collections of contrast, in a data mining environment, applications images to find images of potential interest for various often ask the repository to find out what criteria are reasons. For example, you might want to determine most important. whether a new logo you’ve commissioned might con- For example, a data mining engine can discover, flict with other logos that have already been copy- informing its users of the discovery, that (to use a righted. An SQL statement like this one: famous, if apocryphal, example) about half of the customers who buy both disposable diapers and beer SELECT * will buy an air freshener product as well. This is not FROM REGISTERED_LOGOS the sort of question that most users would dream up WHERE SI_findTexture(newLogo). by themselves (it certainly doesn’t come to our minds SI_Score(Logo) > 1.2 very often!), but it is precisely the kind of relation- would do just what you need. ship that a data mining product will discover. Of course, not all images are “still”. Additional A popular question that a data mining product challenges are posed by moving images, such as digi- might be asked is “Who are my most important cus- tized video. That sort of data is not addressed by tomers and what are the most significant attributes of
those customers and the trends in the values of those Once a model has been created and trained, it attributes?” The first part of the question may seem can be tested by building instances of the easy—it’s usually straightforward to find out what DM_MiningData type that holds test data, and in- customers have bought your products or services stances of the DM_MiningMapping type that specify recently. But “most important” may have other mean- the different columns in a relational table that are to ings than “recent purchases”—profits are not always be used as a data source. The result of testing a model directly related to purchases, since growth rates, ser- is one or more instances of the DM_*TestResult type vice demands, and other factors can significantly (‘*’ can only be ‘Clas’ or ‘Reg’). When running your affect the meaning of importance. model against real data, you get the results in in- Data mining tools are also used for predictive stances of the DM_*Result type (‘*’ can be ‘Clas’, purposes, such as insurance companies mining data ‘Clus’, or ‘Reg’…but not ‘Rule’). on existing customers to help evaluate the risks asso- In most cases, you also create and use instances ciated with new customers. of DM_*Task types to control the actual testing and There are four different data mining techniques running of your models. supported by this standard. One technique, the rule At the time this column went to press, it seemed model, allows you to search for patterns (“rules”) in likely that final progression of the SQL/MM Data the relationships between different parts of your data. Mining standard might be slowed just a little bit to A second technique, the clustering model, helps you ensure that it is fully compatible with a “sister” data group together data records that share common char- mining API being developed for Java by the Java acteristics and identify the most important of those Community Process. characteristics. The third technique, the regression model, helps you predict the ranking of new data Summary based on an analysis of existing data. The final tech- nique, the classification model, is very similar to the The SQL/MM suite of standards includes a Frame- regression model, but it is oriented towards predict- work that describes the conventions used to define ing which grouping or class new data will best fit each of the other parts. There are other parts used to based on its relationship to existing data. manage full-text data, spatial data, and still images, For each of those techniques, as with most data and to data mining. mining product, there are three distinct stages Careful inspection of the references below will through which you can mine your data. First, you reveal that there is no part 4 of this multi-part stan- have to train a model; this means choosing the tech- dard. That’s because an attempt to develop a set of nique most appropriate to your goals, then setting a classes for general mathematical operations was few parameters to orient the model, and finally train- eventually determined to satisfy too few users at too ing the model by applying it to a reasonably-sized great a cost; development of SQL/MM General Pur- data set (perhaps several times for improved valid- pose Facilities was thus abandoned several years ago. ity). Second, if you’re using the classification or re- Not all parts of SQL/MM are yet commercially gression techniques, you can test the model by successful, but the seems to be growing support at applying it to known data and comparing the model’s least for both Full-Text and Spatial by several impor- predictions with that known data’s classification or tant players in those fields. Support for Still Image ranking. Finally, you apply the model to your busi- seems to be developing more slowly, and it’s far too ness data and use the results to improve your enter- soon to say about Data Mining since that part has not prise. yet been published. Whether additional data types The models are supported through the use of (such as moving image data) are ever supported de- several broad categories of new structured user- pends on many factors, including interest from the defined types. For each model, a type known as technical community depending on such data. The DM_*Model (where the ‘*’ is replaced by ‘Clas’ for recent surge in consolidation within the database in- a classification model, ‘Rule’ for a rule model, ‘Clus- dustry causes some to think that there is a reduction tering’ for a clustering model, and ‘Regression’ for a in the need for such standards, but the greater atten- regression model), is used to define the model that tion being paid to the Internet and the World Wide you want to use when mining your data. The models Web prove that the need for portability of data and of are parameterized using instances of the code continues to increase. DM_*Settings (‘*’ is ‘Clas’, ‘Rule’, ‘Clus’, or ‘Reg’) If you’re interested in acquiring copies of the type and the models are trained using instances of the SQL/MM standard’s various parts, you can do so at DM_ClassificationData type. The DM_*Settings type ANSI’s electronic standards store cited below. Unfor- allows various parameters of a data mining model, tunately, even in downloadable (PDF) form, these such as the depth of a decision tree, to be set. standards are a bit pricey. We expect that, once they
