Data Mining Overview
The mining of data is primarily concerned with identifying patterns. Data are taken to be stored in data sets which resemble tables in a database (and very often are tables). Each row in the table is called an instance with a number of attributes which describe the instance. A simple example is the data set containing instances of customer details. Each instance will have attributes such as name, address, phone number and so on.
There are two broad classes of data mining – supervised and unsupervised. Supervised data mining means that one of the attributes is determined by the other attributes in an instance. For example in our customer data set we might have an attribute called ‘credit risk’ and this might be determined from demographics, payment history and so on. An attribute that is singled out in this way is often called a label and the data set is said to be labelled. The patterns that emerge from such data sets have to be established through a learning process where a training data set is supplied so that the various algorithms can find the patterns which determine the likely value of a nominated attribute.
Unsupervised learning on the other hand simply consists in presenting the data mining algorithms with data sets and leaves it up to the algorithms to determine any patterns that might exist within the data. In the customer data example such unsupervised learning might find a correlation between age and products purchased via mechanisms such as clustering and association (to be discussed later).
In summary we can say that supervised learning is largely concerned with predicting the value of of a nominated attribute, while unsupervised learning is primarily concerned with finding patterns in data without any particular attribute being nominated as special.
Supervised learning can be broken down into two main types. Classification aims to place each instance within a data set into one of a number of categories. In the customer data example we might wish to categorise each customer as a high, low or moderate credit risk and be able to predict which category a new customer is likely to fit. Regression is the name given to supervised learning when we are considering numerical values where magnitude of the value has meaning (someone’s age for example). Neural networks are often used in this type of application, but simpler statistical techniques can also be applied in some instances.
Unsupervised learning extracts information which might not be expected to exist, although many of the relationships that unsupervised learning uncovers might not exist in reality (see articles on over-fitting). Association rules are one example of an unsupervised learning technique where correlations are established between attributes and attribute values. One of the best examples of this is basket analysis where retail data is analysed to establish which products might be purchased together by customers. Another technique is clustering where instances in a data set are grouped together based on values in the attributes. For example it might be found that customers with certain demographics tend to avoid certain products (or alternatively buy certain products).
There are several types of variables that can be accommodated by data mining tools, the most important being:
- Nominal Variables – used for categories and may be numerical (but the numbers have no arithmetic context – e.g. a house number).
- Ordinal Variables – have an implied and meaningful order – e.g. hot, tepid, cold.
- Integer variables – are arithmetically meaningful. Integer scaled variables – have a linear scale with a zero that does not imply absence of a quantity (e.g. temperature). Integer Ratio Scaled variables have a zero which does imply a zero amount (money for example!).
Categorical attributes are typically nominal, binary (two valued nominal) or ordinal.
Continuous attributes are integer of all types.