Compare Data

This task will read a series of csv data files and scan them for relationships between one another.

OPTIONS

OPTION	DESCRIPTION
CSV Files	The files containing the schema and data for the tables we wish to compare.

Paths to 2 or more CSV files must be provided.

No special requirements.

A table containing the following columns will be produced.

COLUMN	DESCRIPTION
DATASOURCE1	The name of the first datasource or table to be compared.
DATASOURCE2	The name of the second datasource or table to be compared.
COLUMN1	The name of the column within DATASOURCE1 which is being compared.
COLUMN2	The name of the column within DATASOURCE2 which is being compared.
UNIQUE_MATCHES	The number of unique matches between DATASOURCE1 COLUMN1 and DATASOURCE2 COLUMN2.
MATCH_PERCENT	The percentage of matches between DATASOURCE1 COLUMN1 and DATASOURCE2 COLUMN2.
UNIQUE_TRIMMED_MATCHES	Prior to the comparison, Dex will trim both columns being compared. This is the number of unique matches between DATASOURCE1 COLUMN1 and DATASOURCE2 COLUMN2 after having performed that transformation.
TRIMMED_MATCH_PERCENT	The percentage of matches between DATASOURCE1 COLUMN1 and DATASOURCE2 COLUMN2 after having first trimmed both columns.

Given the schema described below, this example illustrates how we use this task to derive visual insight.

ID	NAME	AGE	GENDER
1	FRAN	32	FEMALE
2	JAN	22	FEMALE
3	STAN	31	MALE
4	DAN	28	MALE

ITEM	CUSTOMER	ITEM	QUANTITY
1	3	2	4
2	2	1	1
3	1	2	2
4	4	3	1
5	1	4	2
6	3	3	2
7	2	4	1

The raw output from Compare Data looks like:

A visualization of running Compare Data on these tables expressed in a Parallel Coordinates chart looks like:

In this visualization we have highlighted all relationships which have a 100% coorelation. We see the following:

RELATIONSHIP	COMMENT
INVENTORY.ITEM and CUSTOMER.ID	While there is a 100% coorelation between these two columns, it is purely accidental as the sequences happen to align. The lesson here is that sometimes coincidental relationships form naturally within a dataset. Do not trust the comparison data blindly.
INVENTORY.ITEM and ORDERS.ITEM	This appears to be a legitimate relationship.
CUSTOMER.ID and ORDERS.CUSTOMER	This appears to be a legitmate relationship.
CUSTOMER.ID and ORDERS.ITEM	Coincidental relationship.
CUSTOMER.ID and INVENTORY.ITEM	Coincidental relationship.