Machine learning model for better decision making( using Cart and C5.0 algorithms in R) Decision tree learners are powerful classifiers that utilize a tree structure to model the relationships among the features and the potential outcomes. This structure earned its name due to the fact that it mirrors the way a literal tree begins at a wide trunk and splits into narrower and narrower branches as it is followed upward. In much the same way, a decision tree classifier uses a structure of branching decisions that channel examples into a final predicted class value. Decision trees are built using a heuristic called recursive partitioning. This approach is also commonly known as divide-and-conquer because it splits the data into subsets, which are then split repeatedly into even smaller subsets, and so on and so forth until the process stops when the algorithm determines the data within the subsets are sufficiently homogenous, or another stopping criterion has been met. There are numerous implementations of decision trees, but two of the most well-known ones are the C5.0 algorithm and the Classification and regression tree (CART). The C5.0 algorithm has become the industry standard for producing decision trees because it does well for most types of problems directly out of the box. There are various measurements of purity that can be used to identify the best decision tree splitting candidate. C5.0 and CART utilise entropy and gini index respectively as impurity measures for selecting attribute. Both measures however have advantages/disadvantages. The process of pruning a decision tree is an important component in the process, as it involves reducing its size such that it generalizes better to unseen data.
Credit Risk assessment is a crucial issue faced by Banks nowadays which helps them to evaluate if a loan applicant can be a defaulter at a later stage so that they can go ahead and grant the loan or not. This helps the banks to minimize the possible losses and can increase the volume of credits. The global financial crisis of 2007-2008 highlighted the importance of transparency and rigor in banking practices. As the availability of credit was limited, banks tightened their lending systems and turned to machine learning to more accurately identify risky loans. Decision trees are widely used in the banking industry due to their high accuracy and ability to formulate a statistical model in plain language. Since governments in many countries carefully monitor the fairness of lending practices, executives must be able to explain why one applicant was rejected for a loan while another was approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory. It is likely that automated credit scoring models are used for credit card mailings and instant online approval processes. R Tool is an excellent statistical and data mining tool that can handle any volume of structured as well as unstructured data and provide the results in a fast manner and presents the results in both text and graphical manners. This enables the decision maker to make better predictions and analysis of the findings.
The data used in this question was originally provided by Dr. Hans Hofmann of the University of Hamburg and hosted by the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). The dataset contains information on loans obtained from a credit agency in Germany. The original dataset contains 1000 entries with 20 categorical/symbolic input attributes. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The idea behind this dataset is to identify factors that are predictive of higher risk of loan default. The provided credit dataset (see attached file: loans.xls) includes however a truncated version of the original dataset with only 15 input attributes.
In this question, you need to develop two credit approval model using C5.0 and CART decision trees respectively. The aim is to decide, at the end, which one of these two models is preferable for this specific financial task. Both models need also to be optimised (via tuning/pruning like methods) in order to obtain even better results. You need to perform the following tasks:
• Import the provided .csv file in your MySQL and check that it has been imported properly. The following tasks however need to perform only via R environment.
• Import the contents of this file from your MySQL into R (using R-related commands) and designate a specific name-variable to store them.
• Explore & prepare the data:
The name-variable where you stored the information, contains 1000 observations (rows) and 16 features (columns). This variable includes information from these applicants such as checking and saving information, the amount of loan they plan to borrow and how many months they plan to return the loan amount, etc. The target feature is located at the last column for applicant’s default status (Yes or no). This column indicates whether the loan applicant is finally gone into default, the ability to pay back the amount they had borrowed plus all the interests. As some of these features are non-numerical in nature, you may consider in transforming them into numerical form, if you think is more convenient to you. Otherwise, you can leave them as “factors” style.
Investigate the relationships and discover rough structures of the imported data. More specifically, find and display the frequency and proportion of the observations from checking_balance, credit_history, purpose, savings_balance, employment_duration, percent_of_income, job and default. Find and display the average value from months_loan_duration, amount, years_at_residence, age, existing_loans_count, and dependents.
The data visualization process will focus on identifying patterns of key features which clearly distinguishes an applicant’s default status. Therefore, via R environment, you need to:
Perform a histogram of months_loan_duration by default classes (i.e. two histograms). What can we derive from these plots? Perform 3D plots of amount, age, month_loan_duration with colour separation of two default classes (yes/no). Perform a scatter plot of amount and age with colour separation of two default classes (yes/no). You need to shuffle and re-order the provided data, so that rows are randomly sorted. Then, you need to split the dataset into training and testing sets. Use 800 and 200 samples for the training and testing sets respectively. In this way, each student will have different training/testing sets. These specific training/testing sets will be used for both decision trees models. • You need to create a decision tree model based on C5.0 algorithm to predict whether a loan applicant will default. You need to use the training set for the creation of that model. The model will be then tested using the testing dataset you have already created. The evaluation of your model will be made through all of the following tools: confusion matrix (CM), Area under the Curve (AUC) and F1 Score. • You need to create a decision tree model based on CART algorithm to predict whether a loan applicant will default. You need to use exactly the same training/testing sets, you used for the C5.0 case. The evaluation of your CART model will be made again with the same, as before, tools. • You need to provide a short discussion, based on these results and decide which model is more suitable for this specific case study. • You need to improve your current models (C5.0 and CART) via Adaptive Boosting and Pruning schemes respectively. Develop, in R again, the necessary models and eventually perform another performance evaluation using the same testing dataset. Provide a short discussion of any improvement you may have compared to your previous models.
In your main text, you need to include together with your results/discussion the related segments of your R code. At the end of your report, you have to include all of your codes as an appendix.