Loads of mis-sold PPI, but WHO will claim? This man's paid to find out
Data mining to fathom the depths of banking's balls-up
Called to account
While all these banks seek to utilise big data to both harmonise accounts and clarify their position regarding the PPI payouts, one of the major players has another big data task on its hands. The UK government bailed out Lloyds during the financial crisis of 2009 with 43.4 per cent stake in the ailing bank. However, European Union law regarded this as state aid and demanded a sell-off to comply with competition rules.
In a project codenamed Verde, Lloyds set about divesting some 630 branches. Its attempts to sell them off to the Co-operative Bank failed recently, as this potential buyer got cold feet in this current economic climate. Yet Lloyds continues the work unabated and intends to offer this ready-made bank, branded TSB, as an IPO instead.
Lloyds has had its own PPI issues to deal with, but this an entirely different project and is nonetheless interesting as it is the reverse of merging – a necessary process in order to select the 630 branches for the sell-off and find out what they can do with the customers. Lloyds has even set up its own bank transfer website to explain the situation to its various account-holders.
Cole has his own take on the issues that this task involves. "Customers don’t think of a branch, they think they are a customer of a bank. Now they’re going to be banking with a newly formed company. So there will be cases where you have a joint account, your wife can have an account in another company, but you have a joint mortgage and things like that, it’s massively complicated. From what I can see, it’s an equally difficult exercise to split up the data as it is to merge it.”
Regardless of whether you’re separating out the data or bringing it all together, data quality is the biggest issue that needs to be addressed before any major number-crunching begins. Cole also speaks of "holes" in the data, where information is missing – such as home address or date of birth. “You’d be surprised to see how many unknown genders there are. That’s interesting.” says Cole.
Determining different data classifications is another aspect that clarifies the picture that’s being built up around a customer. Cole says he usually distinguishes between two types of data: behavioural and profile. With behavioural data it’s typically an accumulation of transactions relating to customer activity, such as purchases or website visits. It ends up in a database and that remains unchanged, and simply builds up over time. According to Cole it’s probably the most valuable source of data that can be collected.
“You can ask someone how often they shop in that supermarket and they will say once a week or twice a month but behavioural data will show exactly how often they shop and what they buy.”
By contrast, profile data or research data is data that can change. Marital status, where you live and what you do. Working on filling in these gaps is just one aspect of a data-mining project, as Cole explains.
“One part of the process is to try to make your data better. So where there is missing information, you try to guess. This includes what the gender would be or if you don’t know the income for that person, you make an estimate or you model it based on all the information [you have] on all the other customers.”
This goes beyond just using a post code but can refer to particular spending patterns. When it comes to filling in the gaps, nothing goes to waste. While there are exceptions, it’s far too time-consuming to laboriously go through every customer profile with missing income details to fathom out a likely figure.
“That’s where the data mining comes in,” says Cole. “You would then build algorithms that will use all the data to make that prediction. Alternatively, you can look at the average or examine a certain range of data – there are a lot of different ways to approach it. In your application you are cleaning the data. That means filling out the blanks and simply checking for errors. For example, a phone number typed into the age field, things like that. Looking for outliers. Again in the analytics you’re interested in the breadth, but you’re also interested in what is coming across.”
Meaningful relationships"Data quality is the biggest issue when you start getting into your task and working with the data. You have a lot of data and you look for relationships, but if you then have something extreme [an outlier] appearing then that could change the whole relationship and create an inaccurate picture. So it’s all about cleaning. Then, by creating these other factors from thousands of data [fields], you’re creating a more manageable amount of factors. You know what you’re looking at in terms of data on the screen.
"The big exercise in the data prep is to get to understand the distribution in the data and the variation. You need variation between two things in order to assess if there is a relationship. If there is no variation, you can’t really say anything from that data. So there’s a lot preparation going on and you’re also normalising data – you’re splitting it up – all sorts of statistical things. You have to massage the data to put it in a form that you can use to run your algorithms. All that you’re doing is programming, writing code.
"You then run your algorithms and select your best algorithms. You get statistics on your screen and you make decisions – it’s often a rigid process. The output could be a credit score or a just number. Or it can be a segment which you would then profile after that. You would send that segment to a marketeer who would then come up with a fancy name for it.
"There’s an operation, a commercial aspect and there’s an insight. And you always try to gain insights because that will help you next time you do the same exercise."