MODELING AND DATA ANALYSIS IN THE CREDIT CARD INDUSTRY: BANKRUPTCY, FRAUD, AND COLLECTIONS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
2002 IEEE Systems and Information Design Symposium•University of Virginia MODELING AND DATA ANALYSIS IN THE CREDIT CARD INDUSTRY: BANKRUPTCY, FRAUD, AND COLLECTIONS Student team: Christopher Allred, Kathryn Hite, Stephen Fonzone, Jennifer Greenspan, Josh Larew Faculty Advisor: William Scherer Department of Systems and Information Engineering Graduate Advisor: Thomas Pomroy Department of Systems and Information Engineering Client Advisor: Douglas Fuller Providian Financial San Francisco, Ca Douglas_Fuller@Providian.com KEYWORDS: CART, Clustering, Distressed debt, CLASIFYING FRAUDULENT TRANSACTIONS Fraudster, Identity theft, Regression, Probabilistic modeling Background ABSTRACT Providian suffers significant losses every year from fraudulent transactions on their credit cards. There are In order to effectively produce quality decisions in three main types of fraud that cause the most significant the modern credit card industry, knowledge must be losses, adding up to millions of dollars each year. The gained through effective data analysis and modeling. three types the accompanying analysis focused upon Through the use of dynamic data-driven decision were lost/stolen, forged response, and non-receipt. making tools and procedures, information can be Lost/stolen fraud occurs when a customer losses their gathered to successfully evaluate all aspects of credit card or the card is stolen while the customer has the card operations. Specifically, areas of bankruptcy, card. Forged response is when a fraudster fills out an fraud, and collections were focused upon to show the application pretending to be someone else with a better salutary benefits implementation of such practices credit history. This is done typically after the fraudster could provide. Methodologies ranging from Markov steals personal information on someone, called Identity chains, to clustering, to rule-based decision theory were Theft. Non-receipt fraud occurs when the card is first combined with tools such as CART, S+, Excel, and sent to the good customer once the application is Access to yield such insights. approved. The card is typically stolen in the mail, and the good customer never receives their card. The INTRODUCTION following figure depicts the lifetime of a credit card and pinpoints where each instance of fraud occurs. Since San Francisco based Providian Financial prides forged response involves identity theft, the figure also itself on the effective use of data driven decision- shows when identity theft can take place. making throughout its business practices. In particular, their lending strategies tend to encompass the underserved market share of high-risk creditors. As with any risk-oriented venture, Providian’s business stratagem requires the utmost degree of information quality and quantity. This necessitates the execution of methodologies and tools discussed in the following sections. 53
Modeling and Data Analysis in the credit card industry: The main achievement of this portion was the transformation of raw data into useful information, with the first step in this process being used to gain an understanding of the data set as a whole. General descriptive statistics are important because they provide the basic framework from which all other conclusions are derived. Moreover, such information acted as a metric to judge whether future conclusions make sense and fit with the general data or whether those conclusions should be reevaluated for errors. Descriptive statistics also checked whether smaller sub- samples of the data set are representative of the data as a whole. The first stage of the modeling process involved analyzing fraud and non-fraud transactions based on all available transaction data. Of the eleven variables Figure 1: Fraudulent Transaction Depiction: This analyzed, the following six were found to be graphic shows a few common ways of perpetrating significant: hours since transaction, number of declines, credit card fraud. number of cash purchases, number of ATM purchases, merchant code, and transaction amount. A variable was Though fraud causes significant loss, there are considered significant whenever the percent difference proportionally few cases of fraud each month compared between the average for the fraudulent population and to the total number of accounts. In the sample of data the non-fraudulent one exceeded 30%. In the case of given from Providian, only 0.34% of the data set was merchant code, significance was determined whenever fraudulent accounts. The following table lists the the rate of fraudulent transactions for a merchant numbers of accounts and the percentage for the data set. significantly exceeded the overall average rate of fraud. These characteristics form the foundation of the model, due to their potential for flagging transactions as General Breakdown of Accounts fraudulent. Number of Percent of The second phase of modeling developed a risk Type of Account Accts Accounts scorecard. This model used the significant characteristics established in phase one to generate a score for every Non-fraudulent transaction, based on that transaction’s data. This score was (N) 305,688 99.66% then used to asses the likelihood of that transaction being Fraudulent (L) 1,045 0.34% fraudulent. The scorecard was implemented using Visual Basic scripts in Microsoft Access. Accuracy and Figure 2: Fraudulent Accounts: This chart shows the performance were then analyzed using Microsoft Excel. numeric and percentage values associated with fraudulent and non-fraudulent accounts in the data set. Each transaction was evaluated individually for each characteristic value. Points were awarded if a characteristic Rule-Based Modeling Decisions differed from the non-fraud average by greater than 10% of the non-fraudulent standard deviation. However, points There were two main phases to approaching the were only awarded for a deviation that was in the direction problem of modeling the fraud risk of individual credit indicating fraud, as those accounts statistically safer than the transactions: (1) Gathering data for a series of average should not be punished. Through iteration, a 10% transaction characteristics and comparing the fraud and deviation was statistically significant in maximizing the non-fraud account averages (2) Incorporating the classification accuracy of the risk scorecard. In the case of characteristics which differed significantly into a risk merchant code, a point was simply awarded whenever the scorecard capable of predicting the likelihood of fraud transaction occurred at a high-risk merchant code. Since in a given transaction. there are six characteristics, any transaction could have a scorecard value from 0 to 6, depending on how many triggers that account satisfied. For example, an account 54
2002 IEEE Systems and Information Design Symposium•University of Virginia with very little time since the last transaction, making a $1 how likely the account is fraudulent. For example, a purchase at a high risk merchant, but who had not recently large cluster of non-fraudulent accounts is accounts that made a cash purchase, ATM withdrawal, or been declined, make a few low charges on their accounts and make a would have a score of 3. payment in the first month. The following figure highlights the performance of the Five clusters of non-fraudulent accounts were scorecard by breaking down the percent of fraudulent identified with a significant degree of accuracy, 99.97% transactions which fell into each score category. or better, and the five clusters contained 28% of the total number of accounts. The result led to an Score % Fraud Transactions important reduction of the suspected list of accounts by 0 15.38% over a quarter. Providian can not only significantly save through lowered operation costs but also focus 1 30.47% detection efforts on the remaining accounts which have 2 47.25% a higher probability of being fraudulent. 3 54.05% 4 71.90% Though the clustering technique could not clearly 5 71.36% establish which accounts are fraudulent, it did quickly 6 76.92% split the accounts into suspicious and unsuspicious groups, allowing Providian to better concentrate Figure 3: Score Fraud Frequencies: This table shows the resources, time, and money. percent of transactions that are predicted to be fraudulent at each risk scorecard value. COLLECTIONS If all transactions with a risk score of 3 or greater are Background predicted to be fraudulent, the accuracy in predicting fraudulent transactions is 60.1%. If only those transactions Providian’s subsidiary, First Select Corporation with scores 4 or greater are labeled fraudulent, the accuracy (FSC), is the largest credit card debt collector in the level increases to 71.8%. The tradeoff faced is that the United States, purchasing billions of dollars worth of higher the score cutoff used, the better the accuracy for that defaulted credit card debt each year for approximately account segment, yet a smaller number of fraudulent six cents on the dollar. Accounts are collected through accounts are actually captured. The accuracy level of 71.8% calls, letters, and in some cases, legal action. misclassifies less non-fraudulent accounts as fraudulent, but Throughout the payment process, FSC continually also misclassifies more fraudulent accounts as non- needs to make a decision about what to do with an fraudulent. Providian would rather contact an account account: continue to attempt collections or sell the erroneously to check up upon suspicious purchases than let account. This makes knowing whether an account will fraudulent transactions slip through. Since this second, false continue to pay of the utmost importance. negative, error is the more serious for Providian, the 60.1% measure was used to take advantage of the lower false Value Analysis for Distressed Credit Card Debt negative rate. Therefore, any transaction with a scorecard value of 3 or greater is considered to be fraudulent, and this Isolating key account attributes proved the most method identifies fraudulent transactions at 60% accuracy. effective way to value Providian’s distressed credit card debt portfolio. Initially, potential variables were Clustering examined relative to desired metrics, to visually see relationships between predictor and target variables. A clustering technique to detect the fraudulent Using this means of analysis, many account attributes accounts was also applied to the database of credit card had either positive or negative correlations to the accounts. The clustering procedure groups accounts account’s cash flow. The most important predictor according to similar characteristics using rules. These variables identified were recency and frequency of past rules use the values of account characteristics to payments. If an account made a payment in any determine to which cluster an account belongs. This particular month, it was determined that the account system was applied to the database of fraudulent had a 90% chance that it would make a payment in the accounts in an effort to classify accounts according to next two months. Correspondingly, a positive relationship between the number of past payments and 55
Modeling and Data Analysis in the credit card industry: probability of future payments was also discovered. The larger number of past payments increased the CART identified rules that would separate the data probability of future payments. depending on different attributes. For example, in a model attempting to predict if an account would pay After this step was completed, a regression model again, CART determined that most accounts that have identified important characteristics that have a not paid more than 1 payment in the last 5 months predictive nature. Using a software regression would not pay again. This rule is an all-encompassing program, S+, the p-values of many predictor variables rule; however, at every month that Providian owned the were generated. In looking at whether an account will accounts the rules changed, depending on their pay again, there existed a high significance between the ownership of the accounts. In doing this, monthly rules p-values for the predictor variables, recency of past classifying the accounts were established. This payments, frequency of payments over the last four effectively formulated a methodology that could be months, and percentage of initial balance paid and the performed on a monthly basis to separate non-paying target variable. Other variables showed significance at accounts from accounts that continued to pay. the 0.05 level: initial balance, balance remaining, frequency of calls made, frequency of right party The final methodology incorporated the rules contacts, status, and rollout. given by CART for months 1-15. These rules, if applied every month, increase Providian’s ability to Once characteristics of accounts with predictive identify accounts that will pay again (have worth) from nature were identified, both target and predictor accounts that have stopped paying (no worth). variables were entered into CART. CART is “the most advanced decision-tree technology for data analysis, ANALYZING BANKRUPT ACCOUNTS preprocessing and predictive modeling. CART is a robust data-analysis tool that automatically searches for The Providian bankruptcy data was grouped into important patterns and relationships and quickly 20 discrete states, allowing for a different form of uncovers hidden structure even in highly complex data” analysis. In analyzing the bankruptcy data, the flow of [Steinberg]. an account from state to state facilitated a glimpse at the actual state transition process account holders went through. By tracing these paths, along with the expected income at each state, we are able to accurately generate an estimate of the future value of each account. p 1 2 Figure 5: One-Step Transition: This diagram depicts the probability, p, for going from state 1 to state 2, or rather, given that the model was in state 1 in the first time period, p is the probability that the model is in state 2 in the next time period. The first step to tracing these paths is to create a Figure 4: Classification CART Tree: This figure depicts the matrix of one-step transition probabilities. Following CART tree used to develop the classification rules for the these states over the lifetime of Providian’s bankruptcy model. Each splitting node shows the criteria for that process allows us to determine some underlying splitter and the percentage of paying and non-paying characteristics of their customers. accounts that made it to that path. Each terminal node shows the number of accounts of each type that were The transition matrix for the bankruptcy model classified in that node and the percentage of paying and showed high recurrence probabilities: the tendency of non-paying accounts that make up that node. an account to stay in the same state after a transition period. This is expected due to the slow nature of many 56
2002 IEEE Systems and Information Design Symposium•University of Virginia of the bankruptcy stages. Figure 2 shows the modeling and data analysis methods, much knowledge probability of staying in each of the 20 states, as well as was gained about the various aspects of Providian’s the corresponding expected stay in each state. This can credit card operations. The insight gained on basic be calculated by using the formula: account operations is appreciable, because having accurate information influences everything from policy Σ n pn-1 (1-p) = 1/(1-p) + p implementation to the bottom-line. From bankruptcy, to fraud, to collections, our analysis proved highly beneficial to Providian. Recurrence Probabilities REFERENCES States Transition Prob Length of Stay 1 40.1% 2.0 Brieman, Freidman, Olshen, Stone, Classification and 2 65.3% 3.5 Regression Trees, St. Louis: Wadsworth, 1984. 3 78.0% 5.3 4 64.3% 3.4 Dwyer, Robert. “Customer Lifetime Valuation to 5 40.0% 2.0 Support Marketing Decision Making.” Journal of Direct Marketing. Volume 11, Number 4 (1997): 6- 6 71.5% 4.2 13. 7 72.2% 4.3 8 10.7% 1.2 Lucas, Peter. “Why Recoveries are on the Rise; Scoring 9 75.8% 4.9 Models and Databases are Helping Collectors Boost 10 6.1% 1.1 Recovery Rates.” Collections & Recovery. Vol 13, No 7. October 2000. 14 October 2001. 11 79.4% 5.6 http://web.lexis-nexis.com/universe. 12 81.9% 6.3 13 68.6% 3.8 Steinberg, Dan and Phillip Colla. CART--Classification 14 0.0% 1.0 and Regression Trees. San Diego, CA: Salford 15 74.2% 4.6 Systems, 1998. 16 93.3% 15.9 17 56.7% 2.8 BIOGRAPHIES 18 89.6% 10.5 19 51.9% 2.6 Josh Larew is a fourth year Systems Engineer from 20 80.0% 5.8 Morgantown, West Virginia. When Josh is not cranking out SQL queries in Access, he can be found at Figure 6: Transition Matrix Statistics: This chart the Birdwood Golf Course scrambling to make par. quantifies the recurrence probabilities associated with Next year Josh will either be working on a submarine the one step probability matrix, including the estimated (no joke) or be unemployed and waiting to go to law length of stay in each state. school. Ultimately, this analysis shows us the important Stephen Fonzone is a fourth year Systems Engineer characteristics of the bankruptcy lifecycle. As one can from Allentown, Pennsylvania. When not using RTPs see, the average consumer that enters state 16 and PTPs to predict customer lifetime value, Steve can (Bankruptcy) stays for 16 months, while others such as be found singing Springsteen and playing Super Tecmo 10 and 1 do not have strong recurrent properties. Bowl (although not necessarily at the same time). Next year Steve will live in a van down by the river. CONCLUSION Kathryn Hite is a fourth year Systems Engineer from Providian is constantly modifying and updating its Huston, Texas. When not clustering transactions to data-driven decision network to formulate strategies catch fraudsters, Kathryn can be found extolling the which best capitalize on the opportunities of this virtues of her native state of Texas. Next year she will dynamic market. By effectively using various follow Josh wherever he may go. 57
Modeling and Data Analysis in the credit card industry: Jennifer Greenspan is a fourth year Systems Engineer from Chicago, Illinois. She spends the majority of her time establishing and analyzing fraud triggers but can also be seen watching Office Space and running (but she usually watches Office Space while sitting). The only group member to actually get a real job prior to graduation, Jen will be working in DC for Capital One. Christopher Allred is a fourth year Systems Engineer from Avon, Connecticut. He can usually be found taking any kind of data and turning it into a Markov Chain. He has also been known to drink a lot of cider and to be surly about staying in Charlottesville for another year, where he will be completing his masters degree. 58
You can also read