Finding the Most Typical Record in a Group
I recently came across the following question: How can I find the most typical record in a group or cluster of records? For example, suppose we have a set of customer records, what is the customer that best typifies the group or cluster? The answer to this question can be used for characterizing groups of records of all types. For example, it can be used for characterizing multimedia collections (e.g., text documents or images).
A general strategy to answer this question is:
- Prepare the data (usually required for distance measure computations)
- Compute the centroid (see definition below) for the group of records
- Calculate a distance or proximity measure between the records and the centroid
- Select the record with the smallest distance to the centroid
- Groups are created based on user-specified criteria
- Groups are automatically generated using clustering
A centroid is one of the pieces of information used in clustering models to describe a cluster. It can be seen as a prototype, a typical value for the records in a cluster or group. For example, in a group of people we can ask what are the typical height, weight, and gender. A reasonable answer would be the average height, the average weight, and the most frequent gender in the group. Taken together this answer could be interpreted as a typical member of the group or the prototype of the group. It may be the case that nobody in the group matches the description though. This prototype is what we call the centroid. For numerical attributes, like weight, the centroid is the average (mean) value in the cluster. For categorical attribute, like gender, the centroid is the most frequent value, also known as the mode of the attribute in the cluster. Another way of describing the centroid is that it is the expected value for each attribute.
Because the centroid is an abstraction, there might not be any member of the group that actually looks like the centroid. Instead of using the centroid as a prototype or representative for the group, in many cases, it is preferable to use the closest record to the centroid. This is also called the medoid. In the example above, this can be seen as the most typical or average person in the group. To make the point more meaningful, suppose we have a set of documents. The centroid of a group of documents is not a document that we can read. If we use the centroid to describe what it is a typical document in the group we cannot readily grasp what are the types of documents in the group. We can find out what are the most frequent terms used in the documents in the group, but that is not as instructive as a representative text. If we use the medoid instead, it is very clear what a typical document looks like. The same idea can be applied to collections of images and groups of record (e.g., customer demographics).
Sample Data
For the discussion below, I will use a data set from Lewis and Taylor (1967) containing the sex, age (months), height (inches), and weights (pounds) of school children. A table (kids) with these data can be created using a script found here.
It is often necessary to prepare (transform) the data before computing a distance metric. Common transformations include:
- Normalization of numeric attributes
- "Explosion" of categorical attributes
- Missing value replacement
Centroid Computation
We can use the SQL functions AVG and STATS_MODE to compute the centroid for a group of records. This is illustrated in the following query:
SELECT stats_mode(sex) sex, avg(age) age, avg(height) height,The query implements the centroid computation using AVG for numeric attributes and STATS_MODE for categorical ones. In this example the centroid is computed for the whole data. For computing the centroid for different groups of records we need to add a GROUP BY clause. For example, the following query computes the centroids for "young" (age <= 150 months) and "old" (age > 150 months) children:
avg(weight) weight
FROM kids;
SELECT age_group, stats_mode(sex) sex, avg(age) age,When the categorical attributes are "exploded," instead of the mode, it is common to use the mean value of the exploded columns as an alternative representation for the centroid. This facilitates the computation of some distance metrics. I follow this approach in the queries below (see Technical Note II for details).
avg(height) height, avg(weight) weight
FROM (SELECT (CASE WHEN age<=150 THEN 0 ELSE 1 END) age_group, a.*
FROM kids a)
GROUP BY age_group;
Distance to the Centroid Computation
There are many types of distance metrics. The Euclidean distance is probably the most used distance metric. It is defined as the square root of the sum of squared differences between the attribute values for two records. For our problem it becomes the sum of squared differences between the attribute values for a record in the group and those for the centroid of the group.
Here are the pieces of SQL that we need to add for each attribute based on the attribute type. To compute the distance between a row and a centroid we only need to add these pieces together. The computations provided below take advantage of some simplification tricks. These computations do not actually compute the full distance metric. They only compute the relevant pieces of the distance metric required for ranking how close a record is to the centroid. Factors that are common for all records are not included. The details are provided in Technical Note II.
For each numeric attributes add the following:
DECODE(r.num, NULL, STDDEV(r.num) OVER() *where num is the name of the attribute and r aliases the source with the rows and c the source with the centroid values.
(STDDEV(r.num) OVER() - 2 * c.num),
r.num * (r.num - 2 * c.num))
For each categorical attribute add the following:
DECODE(r.cat, NULL, 0,where cat is the name of the attribute, rec_group is a column with that identifies the different groups, and r aliases the source with the rows and c the source with the centroid values.
-2 * COUNT(*) OVER (PARTITION BY r.cat, r.rec_group)/
COUNT(*) OVER(PARTITION BY r.rec_group))
For example, for the age, weight, and sex attributes in the kids table we would compute the relevant piece of the Euclidean distance metric as:
DECODE(r.age, NULL,The next two sections illustrate how to put it all together with two examples.
STDDEV(r.age) OVER() *
(STDDEV(r.age) OVER() - 2 * c.age),
r.age * (r.age - 2 * c.age)) +
DECODE(r.weight, NULL,
STDDEV(r.weight) OVER() *
(STDDEV(r.weight) OVER() - 2 * c.weight),
r.weight * (r.weight - 2 * c.weight)) +
DECODE(r.sex, NULL, 0,
-2 * COUNT(*) OVER (PARTITION BY r.sex, r.rec_group)/
COUNT(*) OVER(PARTITION BY r.rec_group))
Example 1- Typical Records For User-Defined Partitions
In this example the goal is to find the most typical record for sets of records grouped based on user-defined criteria. More specifically, we want to find the most typical record for "young" (age <= 150 months) and "old" (age > 150 months) children in the kids table. This can be answered with the following query:
WITH
kids_n AS
(SELECT (CASE WHEN age<=150 THEN 0 ELSE 1 END) rec_group,
cid, sex,
(age-AVG(age) OVER())/STDDEV(age) OVER () age,
(height-AVG(height) OVER())/STDDEV(height) OVER() height,
(weight-AVG(weight) OVER())/STDDEV(weight) OVER() weight
FROM kids),I broke down the query into subqueries that map to the four steps of the approach I outlined above. The kids_n subquery prepares the data. It normalizes the numeric attributes using z-score. The ce subquery computes the centroid for each partition. In this example there is a partition for each unique rec_group value. The dis subquery computes the Euclidean distance from each record in a partition to the centroid of the partition. It constructs the distance computation using the building blocks described above. The med subquery computes the minimum distance to the centroid and selects, for each partition, the record with the smallest distance to the partition's centroid. The final query joins the results with the original data to obtain a description of the record without the normalization.
ce AS
(SELECT rec_group,
STATS_MODE(sex) sex, AVG(age) age, AVG(height) height,
AVG(weight) weight
FROM kids_n
GROUP BY rec_group),
dis AS
(SELECT r.rec_group,
r.cid,
DECODE(r.age, NULL,
STDDEV(r.age) OVER() *
(STDDEV(r.age) OVER() - 2 * c.age),
r.age * (r.age - 2 * c.age)) +
DECODE(r.weight, NULL,
STDDEV(r.weight) OVER() *
(STDDEV(r.weight) OVER() - 2 * c.weight),
r.weight * (r.weight - 2 * c.weight)) +
DECODE(r.height, NULL,
STDDEV(r.height) OVER() *
(STDDEV(r.height) OVER() - 2 * c.height),
r.height * (r.height - 2 * c.height)) +
DECODE(r.sex, NULL, 0, -2 *
COUNT(*) OVER (PARTITION BY r.sex, r.rec_group)/
COUNT(*) OVER(PARTITION BY r.rec_group)) dist
FROM kids_n r, ce c
WHERE r.rec_group = c.rec_group),
med AS
(SELECT *
FROM (SELECT d.*, min(dist) OVER(PARTITION BY rec_group) mdist
FROM dis d)
WHERE dist = mdist)
SELECT m.rec_group, k.*
FROM med m, kids k
WHERE m.cid = k.cid
ORDER BY m.rec_group, m.dist;
The above query returns the following results:
REC_GROUP CID SEX AGE HEIGHT WEIGHTCompare these numbers with those for the group centroid computed on the non-normalized data (results for the explosion of the sex column are also shown below) :
--------- -------- --- ------ ---------- ----------
0 40 m 144 59.5 88
1 11 f 173 62.8 102.5
SELECT rec_group,Example 2 - Typical Record For a Cluster
STATS_MODE(sex) sex, AVG(f) f, AVG(m) m,
AVG(age) age, AVG(height) height,
AVG(weight) weight
FROM (SELECT (CASE WHEN age<=150 THEN 0 ELSE 1 END) rec_group,
DECODE(sex,'f',1,0) f, DECODE(sex,'m',1,0) m, a.*
FROM kids a)
GROUP BY rec_group;
REC_GROUP SEX F M AGE HEIGHT WEIGHT
--------- --- ------ ------ -------- ---------- ----------
0 m 0.35 0.65 144.8 58.875 88.9
1 f 0.525 0.475 173.575 63.498 107.225
Let's assume that the data set from the previous example was clustered into 3 clusters using the kids_cl model (Technical Note III shows how to do it) and we want to identify the most typical record for each cluster. Can we use the clustering model to identify the closest record to a cluster centroid? Yes and no. We can use clustering scoring to partition the records into groups. However, we need to then apply the approach described above. Unfortunately, we cannot use cluster probability for computing the closest record. Those interested can read Technical Note IV below for the details.
The following query returns the most typical record for each cluster:
WITHThe query is basically the same as the one in the previous example. The text in red indicates the change to the query, namely, replaced the rec_group definition. In this example the rec_group column indicates the cluster ID for the record. This column is computed using the SQL data mining function CLUSTER_ID. The kids_n subquery has now two queries. The innermost query, as before, normalizes the data. The outer query adds the re_group column. This is necessary because the clustering model kids_cl was built with normalized data. In order to score this model we also need to feed it normalized data, which is provided by the inner query.
kids_n AS
(SELECT CLUSTER_ID(kids_cl USING a.*) rec_group, a.*
FROM (SELECT cid, sex,
(age-AVG(age) OVER())/STDDEV(age) OVER () age,
(height-AVG(height) OVER())/STDDEV(height) OVER() height,
(weight-AVG(weight) OVER())/STDDEV(weight) OVER() weight
FROM kids) a),
ce AS
(SELECT rec_group,
STATS_MODE(sex) sex, AVG(age) age, AVG(height) height,
AVG(weight) weight
FROM kids_n
GROUP BY rec_group),
dis AS
(SELECT r.rec_group,
r.cid,
DECODE(r.age, NULL,
STDDEV(r.age) OVER() *
(STDDEV(r.age) OVER() - 2 * c.age),
r.age * (r.age - 2 * c.age)) +
DECODE(r.weight, NULL,
STDDEV(r.weight) OVER() *
(STDDEV(r.weight) OVER() - 2 * c.weight),
r.weight * (r.weight - 2 * c.weight)) +
DECODE(r.height, NULL,
STDDEV(r.height) OVER() *
(STDDEV(r.height) OVER() - 2 * c.height),
r.height * (r.height - 2 * c.height)) +
DECODE(r.sex, NULL, 0, -2 *
COUNT(*) OVER (PARTITION BY r.sex, r.rec_group)/
COUNT(*) OVER(PARTITION BY r.rec_group)) dist
FROM kids_n r, ce c
WHERE r.rec_group = c.rec_group),
med AS
(SELECT *
FROM (SELECT d.*, min(dist) OVER(PARTITION BY rec_group) mdist
FROM dis d)
WHERE dist = mdist)
SELECT m.rec_group, k.*
FROM med m, kids k
WHERE m.cid = k.cid
ORDER BY m.rec_group, m.dist;
The above query returns
REC_GROUP CID SEX AGE HEIGHT WEIGHTFrom these results cids 60, 37, and 10 can be used as typical cases to represent clusters 3, 4, and 5 respectively. Also, cluster 3 contain younger children, cluster 4 somewhat older children, and cluster 5 older ones.
--------- -------- --- ------ ---------- ----------
3 60 m 151 59.3 87
4 37 m 193 66.3 133
5 10 f 169 62.3 99.5
Final Comments
The approach described here is very general and flexible. It can be efficiently implemented in a single query. It can also be used with groups defined by the user or automatically generated using a clustering model. However, the distance metric computations need to be tailored to the data. In order to address this, in a follow up post I will provide the code for a procedure that wraps these steps in an easy to use interface.
Technical Note I - Data Preparation
It is usually a good idea to normalize numeric attributes in order to compute distance metrics. That is, try to make each numeric attribute to have the same range. This avoids a single attribute to dominate the distance metric. Suppose all but one of the numeric attributes in a data set have a range between 0 and 1. Let's say that the remaining attribute INCOME has a range between 10,000 and 100,000. In this example, the distance computation between two records will be completely dominated by the differences in values in the INCOME attribute. This is really an artifact of the scale the attribute is measured. One way to address this is to create a new attribute INCOME1 as INCOME/100,000. This is the idea behind normalization of numeric attributes.
A common normalization approach is called z-score normalization. In this case, we create new attributes by subtracting the mean from the original attributes and dividing by the standard deviation. The following query implements z-score normalization for the above kids table:
SELECT cid, sex,Note that the categorical attribute SEX was not transformed.
(age-AVG(age) OVER())/STDDEV(age) OVER () age,
(height-AVG(height) OVER())/STDDEV(height) OVER() height,
(weight-AVG(weight) OVER())/STDDEV(weight) OVER() weight
FROM kids;
Because distance metrics are often defined for numeric attributes only, it is common to transform categorical attributes into a numeric representation. Usually this is done using the "explosion" transformation. In this case, we transform the categorical attribute into a set of binary numeric attributes, one new attribute for each distinct value of the categorical column. For example, transforming the above sex attribute would create two new columns M and F. M is 1 when sex='m' and 0 otherwise. F is 1 when sex='f' and 0 otherwise. The following query implements the explosion transformation for the sex column:
SELECT a.*,This approach is simple but labor intensive. It requires knowledge of the set of distinct values for each categorical column. The computations described in the examples above avoided this step by taking advantage of some simplification tricks (see Technical Note II).
DECODE(sex,'m',1,0) M, DECODE(sex,'f',1,0) F
FROM kids a
Finally, the presence of missing values should also to be taken into account when computing distance metrics. A simple approach is to replace missing numeric attributes by the mean and categorical by the mode. However, this is not desirable for the problem of finding the closest record to the centroid of a group of records. As discussed above, the centroid is computed as the mean and the mode of the attributes. Replacing missing values with the mean and the mode would make records with missing values look closer to the centroid. In the extreme case, a record with only missing values would be exactly like centroid and would be selected as the typical record for the group. Not a very desirable result. An alternative approach to missing value treatment that works for our problem is as follows:
- Numeric attributes: replace missing values with the standard deviation for the group
- Categorical attributes: explode the attribute (missing values are mapped to 0)
The following query implements missing value treatment for the kids table:
SELECT r.cid,The queries in the two examples in the post already implement normalization of numeric attributes, explosion of categorical attributes, and missing value replacement.
DECODE(r.age, NULL, STDDEV(r.age) OVER(), r.age) age,
DECODE(r.height, NULL, STDDEV(r.height) OVER(), r.height) height,
DECODE(r.weight, NULL, STDDEV(r.weight) OVER(), weight) weight,
DECODE(r.sex,'m',1,0) M, DECODE(r.sex,'f',1,0) F
FROM kids r;
Technical Note II - Euclidean Distance Metric Computation
Equation 1 gives an alternative expression for the Euclidean distance (dE) between a row (r) and a centroid (c). Both the row and the centroid can be seen as vectors where each attribute is a component of the vector. Arbitrary unique identifiers (e.g., those generated from a sequence) are left out. Also categorical attributes are represented after applying the explosion transformation. For example, the following record
CID SEX AGE HEIGHT WEIGHTis represented as the vector (1, 0, 144, 59.5, 88). A couple of things to notice:
-------- --- ------ ---------- ----------
40 m 144 59.5 88
- The cid column was left out.
- The first two components of the vector are the exploded representation of sex='m'. A record with sex='f' would be represented by (0, 1, ...).
F M AGE HEIGHT WEIGHTThe dot product r.c for the cid=40 record above is:
------ ------ -------- ---------- ----------
0.35 0.65 144.8 58.875 88.9
144 * 144.8 + 59.5 * 58.875 + 88 * 88.9 + 0.65Equation 3 shows how to compute the norm of the vector r representing a given row. Because the norm of the centroid (||c||) is constant when computing the Euclidean distance between a row and the centroid, we can ignore it in the distance computation. Equation 4 shows that.
Finally, using the results from equations 2 and 3 in equation 4, we obtain the approximation for the Euclidean distance in equation 5. The term nc from equation 3 was left out because it is constant for all rows.
For the kids table, according to equation 5, we would compute the relevant piece of the Euclidean distance metric as:
r.age * (r.age - 2 * c.age) +Adding missing value treatment as described in Technical Note II the approximation becomes:
r.weight * (r.weight - 2 * c.weight) +
r.height * (r.height - 2 * c.height) +
-2 * COUNT(*) OVER (PARTITION BY r.sex, r.rec_group)/
COUNT(*) OVER(PARTITION BY r.rec_group))
DECODE(r.age, NULL,Technical Note III - Clustering Model Creation
STDDEV(r.age) OVER() *
(STDDEV(r.age) OVER() - 2 * c.age),
r.age * (r.age - 2 * c.age)) +
DECODE(r.weight, NULL,
STDDEV(r.weight) OVER() *
(STDDEV(r.weight) OVER() - 2 * c.weight),
r.weight * (r.weight - 2 * c.weight)) +
DECODE(r.height, NULL,
STDDEV(r.height) OVER() *
(STDDEV(r.height) OVER() - 2 * c.height),
r.height * (r.height - 2 * c.height)) +
DECODE(r.sex, NULL, 0,
-2 * COUNT(*) OVER (PARTITION BY r.sex, r.rec_group)/
COUNT(*) OVER(PARTITION BY r.rec_group))
The steps for creating a cluster model (kids_cl), using the data mining PL/SQL API, to cluster the data in the kids table into three clusters are as follows:
- Normalize the numeric attributes
- Create a settings table specifying the number of clusters
- Create the clustering model
BEGINThe above normalization step used z-score normalization (subtract the mean and divide by the standard deviation) and excluded the cid column as it is a unique identifier column. The kids_norm view implements the same computations as those in kids_n subquery described above.
-- Create a normalization table
dbms_data_mining_transform.create_norm_lin (
norm_table_name => 'cl_num');
-- Indicate which columns to normalize and the type of
-- normalization to use
dbms_data_mining_transform.insert_norm_lin_zscore (
norm_table_name => 'cl_num',
data_table_name => 'kids',
exclude_list => dbms_data_mining_transform.column_list (
'CID'),
round_num => 0
);
-- Create the transformed view. Here is where the actual
-- normalization computation takes place
dbms_data_mining_transform.xform_norm_lin (
norm_table_name => 'cl_num',
data_table_name => 'kids',
xform_view_name => 'kids_norm');
END;
/
2. Create settings table
CREATE TABLE cl_settings (3. Create the clustering model
setting_name VARCHAR2(30),
setting_value VARCHAR2(30));
BEGINTechnical Note IV - Clustering Model Probabilities
-- Populate settings table
INSERT INTO cl_settings VALUES
(dbms_data_mining.clus_num_clusters, 3);
COMMIT;
-- Create model
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'KIDS_CL',
mining_function => DBMS_DATA_MINING.CLUSTERING,
data_table_name => 'KIDS_NORM',
case_id_column_name => 'CID',
settings_table_name => 'cl_settings');
END;
/
For a given record, Oracle Data Mining clustering models return the probability of the record belonging to a given cluster. However, this probability cannot be used to identify the closest record to the cluster centroid. The reason is that a clustering model returns the probability of the cluster given the data record (P(cluster | data)). This probability is similar to what we obtain from a classification model. From this perspective, clustering is treated as a classification problem where first the class labels (clusters) are identified and then the records are classified into clusters from a pre-defined set of clusters.
In order to compute the closest record to the centroid we would need P(data | cluster). This probability indicates how likely a given row is to belong to a cluster. In this case, there is no requirement that the data row belongs to any known cluster. Oracle Data Mining clustering models, however, do not return this probability.
References
Lewis, T. and Taylor, L.R. (1967), Introduction to Experimental Ecology, New York: Academic Press, Inc.
Readings: Business intelligence, Data mining, Oracle analytics
When the categorical attributes are "exploded," instead of the mode, it is common to use the mean value of the exploded columns as an alternative representation for the centroid. This facilitates the computation of some distance metrics.
I'm not sure it is only about "facilitating" the computation of the distance ... that fact that STATS_MODE is not deterministic makes this a mandatory step. With your example representing the SEX categories as numeric, let us say, 0 and 1 would probably work too in determining the centroid. A 0.5 value for the centroid is indeed OK ... since, as you point out, the centroid is an abstraction and not a fact. More, the centroid is used to compute “distances” to real records, which themselves have no meaning … except as a comparison mechanism. A centroid value of 0.5 would effectively cancel SEX from the calculation of the overall centroid … where a non-deterministic STATS_MODE, on ties, will tilt the scale of the overall centroid one way or another.
Now, I do see later on you talk about data preparation and, in fact, show the translation to 0 and 1 …
Because distance metrics are often defined for numeric attributes only, it is common to transform categorical attributes into a numeric representation
My view is that this is a mandatory step. Unless ... ??
Very interesting the material you cover. Thank you.
Posted by Anonymous | 8/01/2006 02:31:00 PM
What I tried to say was that explosion is a good strategy to follow when working with some distance functions (e.g., Euclidean). However, not all distances functions require categorical attributes to be transformed into numeric attributes. Also, explosion is not the only strategy one can use. As you pointed out, we can also recode the categorical attribute into k-1 numeric attributes, where k is the number of distinct values in the categorical attribute. In the case of SEX, we have two distinct values (M, F), so we can recode it using a single numeric column. This type of coding becomes harder to interpret when k > 2.
As an addendum, the centroid components referring to the numeric representation of a categorical attribute can be seem as a probability or a frequency. For example, let's say we have exploded SEX into two columns (M and F) and the centroid values for these columns are 0.8 and 0.2 respectively. This would mean that 80% of the records in the group were male and 20% female. In the same way if we had recoded SEX so that male became 1 and female 0 then the attribute SEX would be represented as 0.8 in the centroid.
An example of a distance function that would not require recoding categorical attributes into numeric ones would be to compute the Euclidean distance for the numerical attributes and use a match function for categoricals. Let the centroid reflect the mode and have a value of 'M' for SEX. Then for rows where SEX='M' we would add 0 to the result of the Euclidean distance computed for the numeric attributes. For a row with SEX='F' we would add 1.
Posted by Marcos | 8/04/2006 11:20:00 AM
By the way, thanks for the comments Gabe. They raised some good points and questions. The alternative encoding of categoricals was something I debated if I should describe. I ended up deciding it added complexity to the post. Your comments offered the perfect place to discuss the topic.
Posted by Marcos | 8/04/2006 01:46:00 PM