The dataset ‘Credit data.xlsx’ contains data on 10,000 borrowers and whether they subsequently experienced serious delinquency (see variable ‘SeriousDlqin2yrs’). Assume the lender now wishes to use this data to build a credit scoring model that predicts serious delinquency based on the other variables. The dataset contains the following variables:
1.1 Carefully pre-process the dataset by considering the following activities:
• Exploratory data analysis.
• Missing value handling (if any), including a suitable analysis of missing values and justification of the chosen method.
• Outlier detection and treatment (if any), with appropriate analysis/justification.
• Binning the variables (if deemed useful)
• Coding the variable bins using Weights of Evidence
. • Splitting the data set into a training and test set.
1.2 Build an intuitive and predictive scorecard using a logistic regression classifier and report the following:
• The most important variables
• The impact of the variables on the target
• The performance of the model. Use various performance metrics and discuss their relationship if any.
Compare this scorecard with the result of a Random Forest model run over the data. Discuss your results. Why do banks often use Logistic Regression as their classifier? What do banks win and lose by doing this? In terms of software, you are expected to use SAS Enterprise Miner. Carefully report the various steps of your methodology and discuss your results in a rigorous way! NOTE: It is unlikely that different students will come up with the exact same parameter estimates. Special consideration will be given to submissions whose estimates are identical.
Find an academic paper published in 2020 or later (based on online or print publication date) discussing a real-life application of data mining or credit scoring. It is important that the dataset analysed in the paper consists of real-life (not artificial) data. The suggested publication outlets in which to look for a suitable paper are:
• Management Science •
Operations Research •
INFORMS Journal on Computing
• INFORMS Journal on Applied Analytics
• Journal of Machine Learning Research
• European Journal of Operational Research
• ICDM (The IEEE International Conference on Data Mining)
• NeurlPS (Conference on Neural Information Processing Systems)
• KDD (ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
Note: if you would decide to select a paper from elsewhere, please ensure that it is of sufficiently high quality and makes a novel contribution to the area.
Once you have found an appropriate paper, report the following in separate subsections:
• Title, authors and complete citation (e.g. journal name, volume/issue, year, …)
• The data mining problem considered
• The data mining techniques used
• The results reported
• A critical discussion of the model and results (assumptions made, shortcomings, limitations, …).
Make sure you demonstrate that you understand what the article is all about and are able to provide a critical discussion.
Do not copy and paste from the article. Using Turnitin, this will be easily detected!