Specification of Grouping Preference queries

Specifying the grouping attributes

Grouping on distinct values of attributes A, B, ... is specified with "grouping A, B, ..." completely analogous to the "group by" clause in SQL. Grouping on equivalence classes is described with the construct "SV", short for "Substitutable Values". This is because the Grouping-SV functionality behaves similar to the SV relations on preferences. The equivalence classes forming the groups are from now on called (grouping) SV classes . The syntactic schema of an SV classes specification is as follows (where the [...] braces indicate optional elements):

GROUPING 
A SV ((A_1, A_2, ...) [AS 'class1_a'] ,
(A_3, A_4, ...) [AS 'class2_a'] , ...,
[OTHERS [AS 'others_class_a'] ] ) [AS sv_a],
B SV ((B_1, B_2, ...) [AS 'class1_b'] , ...,
[OTHERS [AS 'others_class_b'] ] ) [AS sv_b], ...

Thereby A_i and B_i are domain values according to the attributes A and B. The domain values (A_1, A_2, ...) are considered equal w.r.t. the given SV relation for attribute A. The names "class..." are aliases for the SV classes which occur in the projection. If not specified, (A_1, A_2, ...) is the default name for an SV class with attributes A_1, A_2, .... The keyword "others" puts all elements which were not be mentioned before, in one "default" SV class. If "others" is not given, all tuples which are not in one of the given SV classes form a SV class of its own.

The name of an SV-attribute (e.g. "sv_a") is also optional, but should be given to reference the SV grouping attributes in the projection. Note that the original attributes A, B, ... stay unchanged; they can still be referenced in the projection.

The entire grouped preference query

The schema of an entire Preference SQL query containing the grouping construct is depicted below, where agg(...) symbolizes an aggregation function, which is optional in the projection.

SELECT 	 	 <projection, aggregation> 
FROM <table_reference>
WHERE <hard_conditions>
PREFERRING <soft_conditions>
GROUPING <attribute_list>
TOP <number>
BUT ONLY <but_only_condition>
HAVING <aggregating_hard_conditions>
ORDER BY <attribute_list>

Such a query is evaluated as follows:

  • If no aggregation function (such as sum(...), count(...), etc.) is given, this query is evaluated as a grouped preference query, i.e., for every group the BMO-Set w.r.t. the preference is calculated. The additional specifier "top k" selects the k-best tuples according to the preference order per group, i.e., k*g tuples are maximally returned if g groups exist. With the "but only" construct it is possible to specify a hard post-selection.
  • If an aggregation function occurs in the projection the following process is done: At first, the groups are calculated as above. Afterwards the having condition is evaluated, i.e., some groups might be excluded. Finally the aggregation function of the projection is applied; hence only one line is returned per group in the result set.

Note that top, but only, having and the use of aggregation is optional. Even the "preferring" keyword is optional, which allows us to use the Grouping-SV syntax for conventional aggregations. Thus we are able to simplify "group by" queries with the lengthy "case ... then ... end" statements. These can be mostly be replaced by our Grouping-SV-syntax which we will illustrate in the following Example.

The order of execution is as follows, wherein all those steps are skipped which are not specified in the query:

  1. The result of the WHERE-clause is partitioned according to the grouping attributes or their SV classes, respectively.
  2. If a preference is given, the BMO-set is calculated separately for every group.
  3. The additional TOP-k specifier selects the k best tuples according to the preference order per group, i.e., k * g tuples are maximal returned if g groups exist.
  4. With the BUT ONLY construct it is possible to specify a hard selection on the BMO set of all groups.
  5. The HAVING condition is evaluated, i.e., those groups are excluded in which HAVING evaluated to false.
  6. If an aggregation function occurs in the projection, it is applied to any group separately; hence only one line is returned per group in the result set.

Example

Assume the following sample dataset with some cars:

idmakepowerprice
1 Smart 60 15000
2 Mercedes 200 38000
3 Audi 180 29000
4 VW 110 25000
5 Bugatti 1000 500000

We give an example to calculate the average price of the two cheapest cars of every car manufacturer. We assume that only an attribute "make" is given for every tuple in the relation, hence we have to retrieve the manufacturer by an appropriate SV grouping on "make".

SELECT manufacturer, avg(price) 
FROM car
PREFERRING price LOWEST
TOP 2
GROUPING make SV ( ('Mercedes', 'Smart', 'AMG') AS 'Daimler', ('VW', 'Audi', 'Bugatti') AS 'Volkswagen' ) AS manufacturer

The result of this query on the given sample dataset is:

manufactureravg (price)
Daimler26500
Volkswagen27000

Note that the Bugatti is not in the Top-2 set of the manufacturer "Volkswagen", hence not included in the average price calculation.

If we omit the preference, i.e., just consider the aggregating query:

SELECT manufacturer, avg(price) 
FROM car
GROUPING make SV ( ('Mercedes', 'Smart', 'AMG') AS 'Daimler', ('VW', 'Audi', 'Bugatti') AS 'Volkswagen' ) AS manufacturer

we could express this in standard SQL in a more lengthy way:

SELECT manufacturer, avg(price) 
FROM (
SELECT (CASE when make IN ('Mercedes', 'Smart', 'AMG') then 'Daimler' when make IN ('VW', 'Audi', 'Bugatti') then 'Volkswagen' ELSE make end) AS manufacturer, price
FROM car) tmp_car
GROUP BY manufacturer

Hence the Grouping-construct allows us to denote group-by-case statements in a more concise way and without subqueries.