The World-Wide Web (``web'') is arguably the world's most frequently used information resource. While current web data has little and mostly local structure, web data will likely have far more in the near future. Specifically, the eXtended Markup Language (XML) is expected to replace the Hypertext Markup Language [12,4]. An XML web page can have a schema of how the data in the page is structured. XML will at best only provide some structure for data since the page-level schemas may (and likely will) vary from page to page. The ability of semistructured data models to accommodate data that lacks a well-defined schema makes them attractive candidates for querying and managing XML data [14,26]. XML-like representation of web meta-data has also been proposed, cf. the RDF standard . Somewhat unlike database meta-data, web meta-data is typically taken to mean additional information about a document, such as the author, subject, language, or URL. In this paper we use the term `meta-data' to encompass both database and web meta-data.
Semistructured data models organize data in graphs [8,14] where each node represents an object or a value, and each edge represents a relationship between the objects or values represented by the edge's nodes. Edges are both directed and labeled. The labels are important because they make nodes self-describing in the sense that a node is described by the sequences of labels on paths through the graph that lead to the node .
This paper introduces an extensible, semistructured data model that generalizes existing semistructured models. In this model, each label is a set of descriptive properties. A property is a kind of meta-data. Typical properties are the name of the edge and the level of security that protects the edge, but any property can be used in a label to describe the nodes that are reachable through that edge.
To exemplify edge labels, consider Figure 1. Part (a) shows a conventional edge that is labeled employee and connects nodes &ACME and &joe. In contrast, part (b) shows the kind of label introduced in this paper. This label is a set of `property name: property value' pairs. Each pair is collectively referred to as a property. This label has two properties: name and transaction time. This generalizes existing semistructures since the label in part (a) can be assumed to specify an implicit name property, with the value employee.
The paradigm of using labels with properties can be recursively applied. For instance, the property name in Figure 1(b) could itself be transformed into a label with two properties: name and language, e.g., English, indicating that name is an English word. While the recursive nature of labels with properties is theoretically appealing, it is of limited utility since meta-meta-data (and meta-meta-meta-data, etc.) is uncommon in the real-world. So although this framework could capture and query recursively nested properties, we focus exclusively on a single level of meta-data in this paper.
Previous research in semistructured and unstructured data models has focussed on basic issues such as query language design [6,7,25,3,20], restructuring of query results [13,2], tools to help naive users query unknown semistructures [16,17], techniques for improving implementation efficiency [25,15,23], and methods for extracting semistructured data from the web [18,24]. Several well-designed languages have also been presented [6,3,20,13,14].
Our paper is different, in part, because it treats edge labels as something other than single words or strings. Buneman et al. also propose a semistructured model with complex labels . In their model, key information from objects in the database is added to labels making each path in the database unique. We focus on adding meta-data rather than data to the labels and on the additional operations necessary to manipulate the meta-data in labels. Another paper with augmented labels presents the Chlorel query language for the DOEM data model . DOEM extends OEM with special annotations on edges to record information about updates; in particular, the (transaction) time and kind of update. This permits a history of changes to a semistructure to be maintained. We further extend the scope and power of the annotations on edge labels into a more general framework. Chlorel is a language for querying the extended data model. Chlorel supports a limited kind of temporal query, which lacks both coalescing and collapsing. We believe these operations are important to correctly supporting temporal semantics .
The paper is organized as follows. Section 2 motivates the extended semistructured model, arguing the utility of introducing a richer structure for labels. Section 3 presents the extended model. Initially, the format of a database is defined. An important feature is that the set of properties present may vary from label to label. Section 3.2 proceeds to introduce several new or extended query operators to contend with properties in labels. Section 4 incorporates the new query operations into a derivative of the SQL-like Lorel query language [25,22,3], called AUCQL, for querying semistructured data with properties. The last section covers future work and summarizes the paper.
The URL <www.cs.auc.dk/~curtis/AUCQL> provides an interactive query engine for the example database given in this paper, documentation and examples on using AUCQL, and a freely-available implementation package.