Types and patterns

Types and patterns

In CDuce, a type denotes a set of values, and a pattern extracts sub-values from a value. Syntactically, types and patterns are very close. Indeed, any type can be seen as a pattern (which accepts any value and extracts nothing), and a pattern without any capture variable is nothing but a type.

Moreover, values also share a common syntax with types and patterns. This is motivated by the fact that basic and constructed values (that is, any values without functional values inside) are themselves singleton types. For instance (1,2) is both a value, a type and a pattern. As a type, it can be interpreted as a singleton type, or as a pair type made of two singleton types. As a pattern, it can be interpreted as a type constraint, or as a pair pattern of two type constraints.

In this page, we present all the types and patterns that CDuce recognizes. It is also the occasion to present the CDuce values themselves, the corresponding expression constructions, and fundamental operations on them.

Capture variables and default patterns

A value identifier inside a pattern behaves as a capture variable: it accepts and bind any value.

Another form of capture variable is the default value pattern ( x := c ) where x is a capture variable (that is, an identifier), and c is a scalar constant. The semantics of this pattern is to bind the capture variable to the constant, disregarding the matched value (and accepting any value).

Such a pattern is useful in conjunction with the first match policy (see below) to define "default cases". For instance, the pattern ((x & Int) | (x := 0), (y & Int) | (y := 0)) accepts any pair and bind x to the left component if it is an integer (and 0 otherwise), and similarly for y with the right component of the pair.

Boolean connectives

CDuce recognize the full set of boolean connectives, whose interpretation is purely set-theoretic.

  • Empty denotes the empty type (no value).
  • Any and _ denote the universal type (all the values); the preferred notation is Any for types and _ for patterns, but they are strictly equivalent.
  • & is the conjunction boolean connective. The type t1 & t2 has all the values that belongs to t1 and to t2. Similarly, the pattern p1 & p2 accepts all the values accepted by both sub-patterns; a capture variable cannot appear on both side of this pattern.
  • | is the disjunction boolean connective. The type t1 | t2 has all the values that belongs either to t1 or to t2. Similarly, the pattern p1 | p2 accepts all the values accepted by any of the two sub-patterns; if both match, the first match policy applies, and p1 dictates how to capture sub-values. The two sub-patterns must have the same set of capture variables.
  • \ is the difference boolean connective. The left hand-side can be a type or a pattern, but the right-hand side is necessarily a type (no capture variable).

Recursive types and patterns

A set of mutually recursive types can be defined by toplevel type declarations, as in:

type T1 = <a>[ T2* ]
type T2 = <b>[ T1 T1 ]

It is also possible to use the syntax T where T1 = t1 and ... and Tn = tn where T and the Ti are type identifiers and the ti are type expressions. The same notation works for recursive patterns (for which there is no toplevel declarations).

There is an important restriction concerning recursive types: any cycle must cross a type constructor (pairs, records, XML elements, arrows). Boolean connectives do not count as type constructors! The code sample above is a correct definition. The one below is invalid, because there is an unguarded cycle between T and S.

type T = S | (S,S)  (* INVALID! *)
type S = T          (* INVALID! *)

Scalar types

CDuce has three kind of atomic (scalar) values: integers, characters, and atoms. To each kind corresponds a family of types.

  • Integers.
    CDuce integers are arbitrarily large. An integer literal is a sequence of decimal digits, plus an optional leading unary minus (-) character.
    • Int: all the integers.
    • i--j (where i and j are integer literals, or * for infinity): integer interval. E.g.: 100--*, *--0[1] (note that * stands both for plus and minus infinity).
    • i (where i is an integer literal): integer singleton type.
  • Floats.
    CDuce provider minimal features for floats. The only way to construct a value of type Float is by the function float_of : String -> Float
  • Characters.
    CDuce manipulates Unicode characters. A character literal is enclosed in single quotes, e.g. 'a', 'b', 'c'. The single quote and the backslash character must be escaped by a backslash: '\'', '\\'. The double quote can also be escaped, but this is not mandatory. The usual '\n', '\t', '\r' are recognized. Arbitrary Unicode codepoints can be written in decimal '\i;' (i is an decimal integer; note that the code is ended by a semicolon) or in hexadecimal '\xi;'. Any other occurrence of a backslash character is prohibited.
    • Char: all the Unicode character set.
    • c--d (where d and d are character literals): interval of Unicode character set. E.g.: 'a'--'z'.
    • c (where c is an integer literal): character singleton type.
    • Byte: all the Latin1 character set (equivalent to '\0;'--'\255;').
  • Atoms.
    Atoms are symbolic elements. They are used in particular to denote XML tag names, and also to simulate ML sum type constructors and exceptions names. An atomic is written `xxx where xxx follows the rules for CDuce identifiers. E.g.: `yes, `No, `my-name. The atom `nil is used to denote empty sequences.
    • Atom: all the atoms.
    • a (where a is an atom literal): atom singleton type.
    • Bool: the two atoms `true and `false.
    • See also: XML Namespaces.

Pairs

Pairs is a fundamental notion in CDuce, as they constitute a building block for sequence. Even if syntactic sugar somewhat hides pairs when you use sequences, it is good to know the existence of pairs.

A pair expression is written (e1,e2) where e1 and e2 are expressions.

Similarly, pair types and patterns are written (t1,t2) where t1 and t2 are types or patterns. E.g.: (Int,Char).

When a capture variable x appears on both side of a pair pattern p = (p1,p2), the semantics is the following one: when a value match p, if x is bound to v1 by p1 and to v2 by p2, then x is bound to the pair (v1,v2) by p.

Tuples are syntactic sugar for pairs. For instance, (1,2,3,4) denotes (1,(2,(3,4))).

Sequences

Values and expressions

Sequences are fundamental in CDuce. They represents the content of XML elements, and also character strings. Actually, they are only syntactic sugar over pairs.

Sequences expressions are written inside square brackets; element are simply separated by whitespaces: [ e1 e2 ... en ]. Such an expression is syntactic sugar for: (e1,(e2, ... (en,`nil) ...)). E.g.: [ 1 2 3 4 ].

The binary operator @ denotes sequence concatenation. E.g.: [ 1 2 3 ] @ [ 4 5 6 ] evaluates to [ 1 2 3 4 5 6 ].

It is possible to specify a terminator different from `nil; for instance [ 1 2 3 4 ; q ] denotes (1,(2,(3,(4,q)))), and is equivalent to [ 1 2 3 4 ] @ q.

Inside the square brackets of a sequence expression, it is possible to have elements of the form ! e (which is not an expression by itself), where e is an expression which should evaluate to a sequence. The semantics is to "open" e. For instance: [ 1 2 ![ 3 4 ] 5 ] evaluates to [ 1 2 3 4 5 ]. Consequently, the concatenation of two sequences e1 @ e2 can also be written [ !e1 !e2 ] or [ !e1 ; e2 ].

Types and patterns

In CDuce, a sequence can be heterogeneous: the element can all have different types. Types and patterns for sequences are specified by regular expressions over types or patterns. The syntax is [ R ] where R is a regular expression, which can be:

  • A type or a pattern, which correspond to a single element in the sequence (in particular, [ _ ] represents sequences of length 1, not arbitrary sequences).
  • A juxtaposition of regular expressions R1 R2 which represents concatenation.
  • A union of regular expressions R1|R2.
  • A postfix repetition operator; the greedy operators are R?, R+, R*, and the ungreedy operators are: R??, R+?, R*?. For types, there is no distinction in semantics between greedy and ungreedy.
  • A sequence capture variable x::R (only for patterns, of course). The semantics is to capture in x the subsequence matched by R. The same sequence capture variable can appear several times inside a regular expression, including under repetition operators; in that case, all the corresponding subsequences are concatenated together. Two instances of the same sequence capture variable cannot be nested, as in [x :: (1 x :: Int)].
    Note the difference between [ x::Int ] and [ (x & Int) ]. Both accept sequences made of a single integer, but the first one binds x to a sequence (of a single integer), whereas the second one binds it to the integer itself.
  • Grouping (R). E.g.: [ x::(Int Int) y ].
  • Tail predicate /p. The type/pattern p applies to the current tail of the sequence (the subsequence starting at the current position). E.g.: [ (Int /(x:=1) | /(x:=2)) _* ] will bind x to 1 if the sequence starts with an integer and 2 otherwise.
  • Repetition R ** n where n is a positive integer constant, which is just a shorthand for the concatenation of n copies of R.

Sequence types and patterns also accepts the [ ...; ... ] notation. This is a convenient way to discard the tail of a sequence in a pattern, e.g.: [ x::Int* ; _ ], which is equivalent to [ x::Int* _* ].

It is possible to use the @ operator (sequence concatenation) on types, including in recursive definitions. E.g.:

type t = [ <a>(t @ t) ? ]    (* [s?] where s=<a>[ s? s? ] *)

type x = [ Int* ]
type y = x @ [ Char* ]       (* [ Int* Char* ] *)

type t = ([Int] @ t) | []    (* [ Int* ] *)

however when used in recursive definitions @ but must be right linear so for instance the following definition are not allowed:

type t = (t @ [Int]) | []      (* ERROR: Ill-formed concatenation loop *)
type t = t @ t               (* ERROR: Ill-formed concatenation loop *)

Strings

In CDuce, character strings are nothing but sequences of characters. The type String is pre-defined as [ Char* ]. This allows to use the full power of regular expression pattern matching with strings.

Inside a regular expression type or pattern, it is possible to use PCDATA instead of Char* (note that both are not types on their own, they only make sense inside square brackets, contrary to String).

The type Latin1 is the subtype of String defined as [ Byte* ]; it denotes strings that can be represented in the ISO-8859-1 encoding, that is, strings made only of characters from the Latin1 character set.

Several consecutive characters literal in a sequence can be merged together between two single quotes: [ 'abc' ] instead of [ 'a' 'b' 'c' ]. Also it is possible to avoid square brackets by using double quotes: "abc". The same escaping rules applies inside double quotes, except that single quotes may be escaped (but must not), and double quotes must be.

Records

Records are set of finite (name,value) bindings. They are used in particular to represent XML attribute sets. Names are actually Qualified Names (see XML Namespaces).

The syntax of a record expression is { l1=e1; ...; ln=en } where the li are label names (same lexical conventions as for identifiers), and the vi are expressions. When an expression ei is simply a variable whose name match the field label li, it is possible to omit it. E.g.: { x; y = 10; z } is equivalent to { x = x; y = 10; z = z }. The semi-colons between fields are optional.

They are two kinds of record types. Open record types are written { l1=t1; ...; ln=tn; .. }, and closed record types are written { l1 = t1; ...; ln = tn }. Both denote all the record values where the labels li are present and the associated values are in the corresponding type. The distinction is that that open type allow extra fields, whereas the closed type gives a strict enumeration of the possible fields. The semi-colon between fields is optional.

Additionally, both for open and close record types, it is possible to specify optional fields by using =? instead of = between a label and a type. For instance, { x =? Int; y = Bool } represents records with a y field of type Bool, and an optional field y (that when it is present, has type Int), and no other field.

The syntax is the same for patterns. Note that capture variables cannot appear in an optional field. A common idiom is to bind default values to replace missing optinal fields: ({ x = a } | (a := 1)) & { y = b }. A special syntax makes this idiom more convenient: { x = a else (a:=1); y = b }.

As for record expressions, when the pattern is simply a capture variable whose name match the field label, it is possible to omit it. E.g.: { x; y = b; z } is equivalent to { x = x; y = b; z = z }.

The + operator (record concatenation, with priority given to the right argument in case of overlapping) is available on record types and patterns. This operator can be used to make a close record type/pattern open, or to add fields:

type t = { a=Int b=Char }
type s = t + {..}               (* { a=Int b=Char .. }
type u = s + { c=Float }        (* { a=Int b=Char c=Float .. } *)
type v = t + { c=Float }        (* { a=Int b=Char c=Float } *)

XML elements

In CDuce, the general of an XML element is <(tag) (attr)>content where tag, attr and content are three expressions. Usually, tag is a tag literal `xxx, and in this case, instead of writing <(`tag)>, you can write: <tag>. Similarly, when attr is a record literal, you can omit the surrounding ({...}), and also the semicolon between attributes, E.g: <a href="http://..." dir="ltr">[].

The syntax for XML elements types and patterns follows closely the syntax for expressions: <(tag) (attr)>content where tag, attr and content are three types or patterns. As for expressions, it is possible to simplify the notations for tags and attributes. For instance, <(`a) ({ href=String })>[] can be written: <a href=String>[].

The following sample shows several way to write XML types.

type A = <a x=String y=String ..>[ A* ]
type B = <(`x | `y) ..>[ ]
type C = <c x = String; y = String>[ ]
type U = { x = String y =? String ..}
type V = [ W* ]
type W = <v (U)>V

Functions

CDuce is an higher-order functional languages: functions are first-class citizen values, and can be passed as argument or returned as result, stored in data structure, etc...

A functional type has the form t -> s where t and s are types. Intuitively, this type corresponds to functions that accept (at least) any argument of type t, and for such an argument, returns a value of type s. For instance, the type ((Int,Int) -> Int) & ((Char,Char) -> Char) denotes functions that maps any pair of integer to an integer, and any pair of characters to a character.

The explanation above gives the intuition behind the interpretation of functional types. It is sufficient to understand which subtyping relations and equivalences hold between (boolean combination) of functional types. For instance, (Int -> Int) & (Char -> Char) is a subtype of (Int|Char) -> (Int|Char) because with the intuition above, a function of the first type, when given a value of type Int|Char returns a value of type Int or of type Char (depending on the argument).

Formally, the type t -> s denotes CDuce abstractions fun (t1 -> s1; ...; tn -> sn)... such that (t1 -> s1) & ... & (tn -> sn) is a subtype of t -> s.

Functional types have no counterpart in patterns.

References

References are mutable memory cells. CDuce has no built-in reference type. Instead, references are implemented in an object-oriented way. The type ref T denotes references of values of type T. It is only syntactic sugar for the type { get = [] -> T ; set = T -> [] }.

OCaml abstract types

The notation !t is used by the CDuce/OCaml interface to denote the OCaml abstract type t.

Complete syntax

Below we give the complete syntax of type and pattern, the former being patterns without capture variables

TO BE DONE

[1] You should be careful when putting parenthesis around a type of the form *--i. Indeed, (*--i) would be parsed as a comment. You have to put a whitespace after the left parenthesis.