In previous paper we were considering the duplicate rows problem resolving, caused by lack of a primary key. Now let consider more difficult case, when key is present but it is synthetic. When design is improper it can lead to duplicate rows appearing, from a subject area view point.
It is strange, but even if I am tell to my students about synthetic primary key disadvantages, they are still use ones in their first data base projects. Probably all people need to enumerate everything :)
I don't want to discuss here trite synthetic key problem. Just would like to tell if you decided to use synthetic key as a primary, you should create natural unique key to avoid situation described below.
So, let we have a table with primary key id and column name. In accordance with subject area restrictions the column name must contain unique value. However, if the table structure is determined as
CREATE TABLE T_pk (id INT IDENTITY PRIMARY KEY,
name VARCHAR(50));
a duplicate rows can appear. It will be better to use table design as
CREATE TABLE T_pk (id INT IDENTITY PRIMARY KEY,
name VARCHAR(50) UNIQUE);
Everybody know which way is better but sometimes we need to deal with inherited structure and data, which is violate subject area restrictions. For example:
id name 1 John 2 Smith 3 John 4 Smith 5 Smith 6 Tom
You can ask: What is the difference between this problem and previous one? Probably here we have more easy solution. We just need to delete all rows from each groups with same value of a name except rows with minimum/maximum value of an id. For example so:
DELETE
FROM T_pk
WHERE id > (SELECT MIN(id) FROM T_pk X WHERE X.name = T_pk.name);
It is right, but I still have not told you everything. Let we have the derived table T_details associated with the table T_pk on the foreign key:
CREATE TABLE T_details (id_pk INT FOREIGN KEY REFERENCES
T_pk ON DELETE CASCADE,
color VARCHAR(10),
PRIMARY KEY (id_pk, color);
This table can contain data like this:
id_pk color 1 blue 1 red 2 green 2 red 3 red 4 blue 6 red
For better visibility let use query
SELECT id, name, color FROM T_pk JOIN T_details ON id= id_pk;
to see the names
id name color 1 John blue 1 John red 2 Smith green 2 Smith red 3 John red 4 Smith blue 6 Tom red
It is shown that one person data is erroneously belong to different parent entries. Furthermore, duplicate rows are in this table as well:
1 John red 3 John red
Similar data will lead to erroneously data analysis. Furthermore, a cascade deleting will lead to data loosing. For example, if we will left only minimum identifier rows in each groups of the T_pk table, we will lost the row
4 Smith blue
in the table T_details. Consequently, during duplicate entries deleting we should take into account both tables - T_pk and T_details.
The data cleaning can be done in two stages:
Query
SELECT id_pk, name, color
, RANK() OVER(PARTITION BY name, color ORDER BY name, color, id_pk) dup
,(SELECT MIN(id) FROM T_pk WHERE T_pk.name = X.name) min_id
FROM T_pk X JOIN T_details ON id=id_pk;
determines the number of a duplicate rows (value of dup value is greater than 1) and the minimum value of an id in a equal-name groups (min_id). Here is the result of that query:
id_pk name color dup min_id 1 John blue 1 1 1 John red 1 1 3 John red 2 1 4 Smith blue 1 2 2 Smith green 1 2 2 Smith red 1 2 6 Tom red 1 6
Now we need to replace the value of a id_pk to min_pk for each row, except the third one, because it is a duplicate copy of the second row. The value of a dup=2 is indicate on that. The query for updating can be as:
UPDATE T_details
SET id_pk=min_id
FROM T_details T_d JOIN (
SELECT id_pk, name, color
, RANK() OVER(PARTITION BY name, color ORDER BY name, color, id_pk) dup
,(SELECT MIN(id) FROM T_pk WHERE T_pk.name = X.name) min_id
FROM T_pk X JOIN T_details ON id=id_pk
) Y ON Y.id_pk=T_d.id_pk
WHERE dup =1;
Updated table T_details will be like this:
id_pk color 1 blue 1 red 2 blue 2 green 2 red 3 red 6 red
It is shown that only one duplicate row is left:
3 red
But it is no need to worry about that row because it will be deleted after duplicate rows cascade deleting from table T_pk:
DELETE
FROM T_pk
WHERE id > (SELECT MIN(id) FROM T_pk X WHERE X.name = T_pk.name);
The last query is a second stage of a deleting procedure. The result of that query looks like this:
Table T_pk id name 1 John 2 Smith 6 Tom Table T_details id_pk color 1 blue 1 red 2 blue 2 green 2 red 6 red
Only need to apply restriction to avoid duplicates in future:
ALTER TABLE T_pk
ADD CONSTRAINT unique_name UNIQUE(name);
Home | SELECT exercises (rating stages) | DML exercises | Developers |