- Project Overview
- Key Objectives
- Key Performance Indicators
- Overview of Dataset Structure
- Tools Used
- Data Cleaning
- Feature Engineering
- Exploratory Data Analysis
- Statistical Analysis
- Logistic Regression Predicting Road Test Success
- Failure Reasons
- Data Analysis and Visualization
- Key Findings
- Recommendations
- Limitations
This project analyzes a simulated DMV road test dataset to uncover the key factors that influence whether applicants pass or fail their driving test. With high failure rates leading to delays and added costs, the study aims to generate actionable insights that can improve training programs, optimize resource allocation, and support applicants in being better prepared.
-
Demographic Insights: Explore pass/fail trends across age, gender, and race.
-
Training Effectiveness: Evaluate the impact of different training levels (Advanced, Basic, None) on outcomes.
-
Performance Drivers: Identify the most significant road assessment indicators that predict success.
-
Predictive Modeling: Build and test a logistic regression model to estimate pass likelihood and measure performance using accuracy, precision, recall, and F1-score.
-
Threshold Analysis: Demonstrate how varying cutoff points affect predictions using confusion matrices.
-
Interactive Dashboards: Design Power BI dashboards with dynamic filters (demographics, training, results) to provide intuitive, real-time insights.
- Contains data on applicantโs demographics (age , gender and race), training participation and road assessment indicators.
- Excel- Feature Engineering, Exploratory Data Analysis, Statistical Analysis and Data Visualization
- Power Bi- Data Visualization using DAX formulas
-
Accurate data types of the fields were assigned.
-
The dataset was thoroughly checked for all the missing values and for duplication of data and no such things were found.
-
All the numeric fields are converted into numeric datatypes from general datatypes to ensure proper analysis.
- The Age column was derived from the Age Group column using Generative AI tools. Applicants in the Teenager group were assigned ages 16โ19, Young Adults were assigned ages 20โ29, and Middle Age applicants were assigned ages 30โ50.
- For logistic regression, categorical variables were encoded into numeric format. The Gender column was converted to binary, with Male = 1 and Female = 0. From the Training_Type column, two dummy variables were created: Training_Advanced (Advanced = 1, Basic/None = 0) and Training_Basic (Basic = 1, Advanced/None = 0).
EDA involved exploring the DMV road assessment data to answer key questions, such as:
-
What is the total number of applicants, and among them, how many successfully qualified versus how many did not qualify?
-
What is the Qualification rate in DMV Road Assessment?
-
What is the distribution of applicants by gender (male vs. female), and within each group, how many passed and how many failed?
-
What is the distribution of applicants across different age groups?
-
What is the distribution of applicants across different race?
-
What is the distribution of applicants across different training types, and within each type, how many passed, how many failed, and what is the qualification rate?
-
What is the overall average age of applicants, and how does it compare between those who qualified and those who did not?
-
What is the distribution of pass rates across gender, age groups, training types, and race?
-
What are the pass rates of male and female applicants across different training types (Advanced, Basic, None)?
A T test was conducted between ages of passed people and failed people and we got that the age significantly influences the outcome (p = 0.005). On average, candidates who passed the road test were older (28.7 years) compared to those who failed (26.3 years). This suggests maturity/experience may improve success rates.
Chi Square test was conducted between Gender vs Pass-Fail and we got that there is no significant relationship between gender and pass/fail outcome.
Chi Square test was conducted between Training Type and Pass/Fail and we got that there is strong and statistically significant relationship between Training_Type and pass/fail outcome.
- Logistic regression model built to predict Pass/ Fail outcome.
- Predictor variables used: Gender ,Age , Training_Advanced , Training_Basic , Theory Test, Reaction time, Signals, Speed Control, Road_signs, Mirror usage ,Confidence, Parking, Night_Drive, Steer_Control.
- Categorical variables converted to dummy variables (Males=1, Females=0,In Reaction_Fast Fast=1,Slow & Average=0 and In Reaction_Slow Slow=1, Fast & Average=0).
- Model assigns a probability of passing (0-1) for each candidate.
- Cut-off is set in a dynamic way in excel to classify โPassโ Vs โFailโ.
- By using the Maximum Likelihood Estimation Method(MLE) we got that Training Type Advanced and Basic are the main deciding factors for whether a person will pass or fail in the driving test.
- A confusion matrix is set to count True Positive , True Negative , False Positive , False Negative and from that accuracy , precision and F1 score is also calculated where we minimized False Positives as we canโt afford to mark inefficient drivers as an efficient drivers.
- Here cutoffs are set in excel in a dynamic way so that whenever we will change the cutoff TP, TN, FP, FN, Accuracy, Precision and F1 score will change automatically when we will refresh the pivot table.
After Filtering By P Value We got Training Advanced and Training Basic are most important decision making Predictors and negative coefficient for intercept shows without training or skills the base chance of passing is low.
The strongest drivers of passing are Advanced Training (coef 0.45) and Basic Training (coef 0.31), followed by skill-based factors like Signals, Road Signs, and Mirror Usage. Demographics such as age and gender show no significant effect
After data cleaning ,pre processing, data analysis using statistical techniques all the excel files are loaded into Power Bi. Then with the help of Power Bi DAX formulas , Visualization Charts and other important features of Power Bi Three Dashboards were created to address all the questions in problem statement.
-
Purpose- Highlights trends in qualification rates across age groups, gender, race, and training types to uncover demographic and behavioral patterns.
-
Viualization-
-
Purpose- Evaluates the impact of different training programs and demographic attributes on pass/fail outcomes, helping identify factors that enhance success rates.
-
Visualization-
-
Purpose- Applies predictive modeling through logistic regression to estimate the likelihood of qualification, enabling data-driven decision support.
-
Visualization-
1. Training Impact: Applicants with Advanced training achieved the highest pass rate (88.16%). Those with Basic training had more candidates under slower reaction types.
2. Gender Insights: Males had the highest qualification rate (50.82%), while females had the highest fail rate (51.17%).
3. Age Trends: The Middle Age group (30โ50) had the highest pass rate (39.36%), whereas Teenagers (16โ19) experienced the highest fail rate.
5. Race Analysis: The Others race group had the highest qualification rate (34.06%), and White candidates had the highest fail rate.
6. Test Scores: Average written and theory test scores were highest among Young Adults (20โ30) and those with Advanced training.
7. Correlation Insights: Age positively correlates with qualification, indicating maturity improves success likelihood.
8. Predictive Modeling: Logistic regression showed applicants with Advanced or Basic training are more likely to pass.
9.Cutoff Optimization: At a 60โ88% cutoff, the model achieved 73% accuracy with minimal false positives, aligning with the goal of reducing candidates incorrectly predicted as passing.
- Applicants who took advanced training had the highest chance of passing. Increasing access to such programs will improve pass rates.
- Even basic training gives applicants a better chance of success compared to no training. Making this more widely available can reduce failures.
- Indicators like signals, road signs, and mirror usage showed significant impact on outcomes. Extra practice and testing in these areas can raise success rates.
- An interactive dashboard helps DMV officials track pass rates by gender, age, and training type. This makes it easier to spot trends and adjust training programs quickly.
- While gender showed no significant impact , age and training choices did. Monitoring subgroups performance ensures fairness and highlights were extra support is needed.
- We only looked at existing data. We have seen patterns (like training and pass rates), but we canโt be 100% sure that training causes higher pass rates. There may be other reasons.
- The people who choose โadvanced trainingโ might already be more motivated or better prepared than those who donโt. So, their higher pass rate may not be only because of the training.
- Some information might not be fully accurate (for example, if skills or training were recorded incorrectly, or if people reported it themselves instead of being observed).
- We didnโt have all possible factors (like prior driving experience, confidence level, or family income). These could also influence pass/fail results but were not included in the model.