Click here for Excel Data Set
Project Description :
This case study aims to illustrate the practical application of Exploratory Data Analysis (EDA) within a real-world business context. In addition to implementing EDA techniques covered in the module, this study also provides a foundational understanding of risk analytics in the banking and financial services sector, demonstrating how data is utilized to mitigate financial risks associated with lending to customers. This project has expanded my analytical skill set significantly, introducing me to several key concepts:
• Grasping the notion of data imbalance.
• Conducting both univariate and segmented variate analyses.
• Exploring relationships between variables through bivariate analysis.
• Identifying and extracting actionable insights by discerning correlations within the data.
Moreover, it provides valuable insights into the workings of the banking sector, enriching my understanding of this industry.
Approach :
To kickstart the project, I meticulously reviewed the dataset and outlined the tasks ahead. My initial step involved tidying up the dataset in Excel, where I systematically eliminated null columns with null values over 30 percent.
Link to Excel Sheets
Video Link
(Kindly download and view)
During the data cleaning process, several issues were identified:
• 50 columns with null values over 30 percent were dropped.
• The remaining columns with null values between 0 and 30 Percent were imputed and retained. These include EXT_Source_2,EXT_Source_3,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY, AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MONTH,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,NAME_TYPE_SUITE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE and AMT_GOODS_PRICE.
Quartiles, Range and Upper bound and Lower bound were calculated to spot and fix the presence of Outliers. Then Mean and Mode were calculated to fix the outliers.
• Column FLAG_MOBIL was dropped as all the values were 1 and it had only one value with 0.
Once the data was cleaned, I proceeded with analysing the data according to the required tasks. Data Imbalance was calculated. Univariate, Segmented Univariate and Bivariate analysis were performed to gain insights into the driving factors of loan defaults. Correlations between the variables and targets were looked at to provide further insights on indicators of loan defaults.
Bar graphs, correlation matrices have been used to assess the direction and strength of the relationships between various elements.
Data Analytics Tasks:
A. Identify Missing Data and Deal with it Appropriately:
As a data analyst, you come across missing data in the loan application dataset. It is essential to handle missing data effectively to ensure the accuracy of the analysis.
Your Task: Identify the missing data in the dataset and decide on an appropriate method to deal with it using Excel built-in functions and features.
• Insight – The analysis of the dataset shows there were 50 columns with null values. These columns were dropped from the table. The remaining columns with null values between 0 and 30 Percent were imputed. Column FLAG_MOBIL was dropped as all the values were 1 and it had only one value with 0. Presence of Outliers were tested in order to check the method of impunity. Null values of Name_type_suite was filled with Unaccompanied as it was in the case in the majority of the cells.
B. Identify Outliers in the Dataset:
Outliers can significantly impact the analysis and distort the results. You need to identify outliers in the loan application dataset.
Your Task: Detect and identify outliers in the dataset using Excel statistical functions and features, focusing on numerical variables.
• Insight – As the table shows, outliers were found in AMT_INCOME_TOTAL, AMT_CREDIT, AMT_ANNUITY, REGION_POPULATION_RELATIVE, CNT_CHILDREN, AMT_GOODS_PRICE, DAYS_REGISTRATION, DAYS_LAST_PHONE CHANGE and DAYS_EMPLOYED. Mean was calculated for these columns and the outliers were dealt with.
C. Analyze Data Imbalance:
Data imbalance can affect the accuracy of the analysis, especially for binary classification problems. Understanding the data distribution is crucial for building reliable models.
Your Task: Determine if there is data imbalance in the loan application dataset and calculate the ratio of data imbalance using Excel functions.
• Insight – From the table, it's evident that 92% of the target are loan re-payers while 8% are defaulters. A total of 24825 amount for the 8% defaulters while 282686 are the 92% on loan re-payers. This is a clear indication of a high imbalance which is a positive sign for the bank.
D. Perform Univariate, Segmented Univariate, and Bivariate Analysis:
To gain insights into the driving factors of loan default, it is important to conduct various analyses on consumer and loan attributes.
Your Task: Perform univariate analysis to understand the distribution of individual variables, segmented univariate analysis to compare variable distributions for different scenarios, and bivariate analysis to explore relationships between variables and the target variable using Excel functions and features.
• Insight –
Based on the tables and graphs above – The Unviariate analysis shows that applicants with the salary of 9 Lacs and above are the ones with the highest numbers of loan approvals. The 2.5-3 Lacs is the second highest with 31759 loans being approved.
The Segmented Unviariate analysis shows the comparisons of those approved and those who are denied in the same income group. Applicants with the income group between 125 – 150K had the highest numbers of loans denied at 4053.
The Segmented Unviariate analysis shows the average amount of credit given on the basis of income slabs. Here the graph shows that Those with the income of 5 lacs and above were given the highest average credit of Rs. 895110. The lowest credit with an average of Rs47081 were given to applicants between the income slab of 25000 – 50000.
E. Identify Top Correlations for Different Scenarios:
Understanding the correlation between variables and the target variable can provide insights into strong indicators of loan default.
Your Task: Segment the dataset based on different scenarios (e.g., clients with payment difficulties and all other cases) and identify the top correlations for each segmented data using Excel functions.
Insight: The heatmap displayed above indicates the correlation between different variables related to the target. It employs a color scheme ranging from red to white to blue, with blue representing the strongest correlation and red signifying the weakest. Therefore, the most pertinent correlations include total income amount to credit amount, days since birth to days of employment, and days of employment to days since ID publication.
Result: During this project, I gained an appreciation for the significance of data analytics in bank loan analysis. It offers valuable insights that facilitate informed, data-driven decision-making processes. Throughout the project, I uncovered insights such as the significance of certain features in predicting loan defaulters, as well as correlations among factors like income, loan amount, and personal asset details. These findings can be effectively communicated to pertinent stakeholders as needed.