next up previous
Next: Backwards Compatibility Up: Retrieving Information From Semistructures Previous: Path Validity

   
Path Match

In this section, we first provide a means of determining whether a user-given descriptor, specified in a query, matches a label. The label matching operation is then incorporated into an Match operation to match a path regular expression to paths in the semistructure.

Label matching in existing semistructured query languages is straightforward. The descriptor is typically a single word or phrase that is compared, using string comparison, to the label. For example, in the regular expression (person | employee).name?, the descriptors, the basic building blocks of the regular expression, are person, employee, and name. During evaluation of this expression, the descriptor person would only match a label person on an edge. More flexible string comparisons between descriptors and labels are supported in some languages, such as Lorel [3], which reuse the wildcard operator `%' from SQL. The descriptor per% would match any label that starts with `per'.

The semantics of label matching is more involved in our model since each label is a set of properties. In addition, string comparison is insufficient because many properties are not strings. These complications are addressed in the label match operation LaMa , defined below. In general, operation LaMa succeeds if every individual property in the descriptor has a match in the label or is missing from the label. Extra properties in the label are ignored, and different ${\it PrMa\/}\xspace_p$ operations are used for different properties, p. Note that the descriptor is a label in the operator definition.

There are three cases to consider. (1) A required property in one label is missing from the other label. In this case, the match does not succeed. A required property must be present in both labels. (2) A non-required property in one label is missing from the other label. In this case, the match succeeds because missing properties are treated as don't care information. (3) The property is present in both labels. The predicate, ${\it PrMa\/}\xspace_p$ specific to the property is used to determine if the property values match. Required and non-required properties are treated the same.

[ ${{\it LaMa\/}\xspace}_{\Gamma}:
{\it LABEL} \times {\it LABEL} \rightarrow {\it BOOL}$]

Label ${\cal L}$ is matched against label ${\cal S}$ as follows. LaMa depends on the semantics of the properties as specified in $\Gamma$, since properties in the labels are individually matched.


		${\it LaMa\/}\xspace_{\Gamma}({\cal L},~{\cal S}) \stackrel{\scriptstyle \triangle}{=}~$

$\forall p, x [ (p{\rm !}~x) \in {\cal L} \Rightarrow \exists y [(p{\rm :}~y) \in {\cal S} \wedge {\it PrMa\/}\xspace_p(x, y)]]~\land$
$\forall p, y [ (p{\rm !}~y) \in {\cal S} \Rightarrow \exists x [(p{\rm :}~x) \in {\cal L} \wedge {\it PrMa\/}\xspace_p(x, y)]]~\land$
$\forall p, x, y [ (p{\rm :}~x) \in {\cal L} \wedge (p{\rm :}~y) \in {\cal S} \Rightarrow {\it PrMa\/}\xspace_p(x, y)]$`$\mathchoice{\raisebox{-.3ex}{\vbox{\hrule width5pt height2.5pt
\hrule width5pt...
...dth5pt height2.5pt
\hrule width5pt height2.5pt
\hrule width5pt height2.5pt}}}$2

The property-specific predicate ${\it PrMa\/}\xspace_p$ matches two property values. For example, equality may be used for name, and time interval overlaps may be used for transaction time. See Table 1.

label that follows requires a movie description.


		 ${\cal L}_{\it movie\/}$
:= {(name! movie)}
In Figure 2, there are two labels with a movie name property. One describes &Color of Night; the other, &Star Wars IV.

		 ${\cal S}_{\it c\/}$
:=   {(name: movie), (security! over 18)}

${\cal S}_{\it w\/}$ :=  {(name: movie), (trans. time: [31/Jul/1998 - uc])}
These labels are matched as follows.

Operation LaMa is the basis for interpreting regular expressions of descriptors. Generally, these regular expressions are interpreted exactly as in other semistructured query languages, and the usual regular expression operations (+, *, ?, |, and . for sequencing) have their usual meaning. The only essential difference between our language and standard semistructured query languages is that the matched path is checked to ensure that it is valid. The following operation extends a set of paths in a semistructure, if the sequence of labels on an extended path matches the regular expression and the entire path is valid.

[ ${\it Match\/}\xspace_{\it DB}:
2^{\it PATHS} \times {\it REG} \rightarrow 2^{\it PATHS}$]

Let S be a set of starting paths (typically the roots of the semistructure) and X be a regular expression over an alphabet of (extended) labels. Then X is said to match a path in ${\it DB\/} = (V,~E,~{\it\&root},~\Gamma)$ by extending a path in S as follows.


		${\it Match\/}\xspace_{\it DB}(S,~X) \stackrel{\scriptstyle \triangle}{=} \{ x ~\vert~x \in M(S, X) ~\wedge~ {\it Valid\/}\xspace_{\Gamma}(x) \}$,

where the matcher, M, is defined as follows.
$M(S, {\cal L}) = \{ v_1~{\mathop{\longrightarrow}\limits^{{\cal L}_1}}~\ldots~{\mathop{\longrightarrow}\limits^{{\cal L}_{m}}}~v_{m+1} ~\vert~$
$v_1~{\mathop{\longrightarrow}\limits^{{\cal L}_1}}~\ldots~{\mathop{\longrightarrow}\limits^{{\cal L}_{m - 1}}}~v_m \in S~\land$
$v_{m}~{\mathop{\longrightarrow}\limits^{{\cal L}_{m}}}~v_{m+1} \in {\it E}~\land~ {\it LaMa\/}\xspace_{\Gamma}({\cal L}, {\cal L}_{m}) \}$
M(S, X.Y) = M(M(S, X), Y)
$M(S, X*) = S~\cup~M(S, X+)$
M(S, X+) = M(S, X.X*)
$M(S, X?) = S~\cup~M(X)$
$M(S, X \vert Y) = M(S, X)~\cup~M(S, Y)$`$\mathchoice{\raisebox{-.3ex}{\vbox{\hrule width5pt height2.5pt
\hrule width5pt...
...dth5pt height2.5pt
\hrule width5pt height2.5pt
\hrule width5pt height2.5pt}}}$2
the definition, the matcher M extends a path in S by recursively decomposing a path regular expression (the expression unifies with the second argument). The matcher extends a standard semistructured database matcher to use ${\it LaMa\/}\xspace_\Gamma$ to match individual labels, as discussed above.

We note that the presence of cycles in the semistructure can lead to an infinite result set, just like matching in any semistructured query language. Consequently, when this operation is implemented, some strategy must be adopted to either break cycles (e.g., node marking is used for Lorel) or otherwise generate a finite result sets (e.g., stop after the first N matches). Which strategy to use is a decision best left to a language designer; AUCQL uses node marking to break cycles.

The cost of Match is essentially the same as path matching in a normal semistructured database: at worst the entire semistructure is explored. The path validity can be computed as each path is explored, although it costs an extra factor of ${\cal O}(m)$, where m is the number of properties in a label. ${\it LaMa\/}\xspace$ is also an ${\cal O}(m)$ operation, assuming that the properties in a label are sorted or hashed. So overall, the cost of matching in our framework grows by a factor of the size of each label.

Sometimes only the set of final nodes in a set of paths is desired.

[ ${\it Nodes\/}\xspace:
2^{\it PATHS} \rightarrow 2^{\it NODES}$]

Let P be a set of paths.


		${\it Nodes\/}\xspace(P) \stackrel{\scriptstyle \triangle}{=}\{ w ~\vert~ v~{\ma...
...s^{{\cal L}_1}}~\ldots~{\mathop{\longrightarrow}\limits^{{\cal L}_m}}~w\in P \}$
`$\mathchoice{\raisebox{-.3ex}{\vbox{\hrule width5pt height2.5pt
\hrule width5pt...
...dth5pt height2.5pt
\hrule width5pt height2.5pt
\hrule width5pt height2.5pt}}}$2


 A user is interested in retrieving information about movie stars as of 31/Jul/1998. That set of nodes can be obtained as follows.


		 ${\cal L}_{\it movie\/}$
:=   {(name! movie),

(trans. time: [31/Jul/1998 - 31/Jul/1998])}
${\cal L}_{\it stars\/}$ :=  {(name! stars),
(trans. time: [31/Jul/1998 - 31/Jul/1998])}
${\cal L}_{\it name\/}$ :=  {(name! name),
(trans. time: [31/Jul/1998 - 31/Jul/1998])}
${\it Nodes\/}\xspace({\it Match\/}\xspace_{\it DB}({\it ROOTS\/}, ~{\cal L}_{\it movie\/}.{\cal L}_{\it stars\/}.{\cal L}_{\it name\/}))$

Recall that ROOTS is the set of edges from $\mathit{\&root}$to roots in the semistructure. The regular expression in this example is a sequence of descriptors. In each descriptor, the name is required (so an edge without a name will not match), but the transaction time is not required (an edge that is missing a transaction time is presumed to exist at all transaction times). Properties not mentioned in the descriptor are ignored in the path matching, unless the property is required, in which case the label is not matched.

It is instructive to consider four paths in Figure 2. (1) The path through &Color of Night to the misspelled value Bruce Wilis is not matched since the required level of security (over 18) is missing from the descriptors. The user must have a digital certificate that authenticates her or him as being over 18, and must add that to the first descriptor to match that edge. (2) The path through &Color of Night to the value Bruce Willis is also not matched for the same reason. (3) The path through &Star Wars IV to the misspelled value Bruce Wilis matches the regular expression, but is not a valid path (see Example 3.7). (4) The path through &Star Wars IV to the value Bruce Willis is the only path that both matches the regular expression and is a valid path. $\mathchoice{\raisebox{-.3ex}{\vbox{\hrule width5pt height2.5pt
\hrule width5pt...
...dth5pt height2.5pt
\hrule width5pt height2.5pt
\hrule width5pt height2.5pt}}}$2



next up previous
Next: Backwards Compatibility Up: Retrieving Information From Semistructures Previous: Path Validity

Copyright © 1998. Curtis E. Dyreson, Michael H. B&. All rights reserved.