- The answer for question 1 are in A3Q1, here are some explaination of my figure:
This density plot shows the distribution of tip amounts in the NYC taxi dataset. Most tips are concentrated in the lower range (0–5 dollars), with a long right tail indicating a few high-value tips. As part of exploratory data analysis (EDA), this plot helps reveal the skewness of the tip distribution, suggesting the potential need for outlier treatment or log transformation in modeling.
This heatmap visualizes the pairwise correlation coefficients between all numeric variables in the sampled NYC taxi dataset. The color scale ranges from deep blue (strong negative correlation) to deep red (strong positive correlation), with values close to 1 indicating a strong positive linear relationship and values close to -1 indicating a strong negative relationship. This plot is useful in EDA to identify which features are strongly correlated with the target variable (e.g., tip_amount) and with each other—helping guide feature selection, reduce multicollinearity, and inform model design.
This plot displays the density distribution of tip amounts by different payment types in the NYC taxi dataset. Each color represents a different payment method (labeled as categories '1' to '4'), and the distribution is shown only for tips between 0 and 2 dollars. The plot reveals how tip behavior varies by payment method. For example, some payment types may be associated with consistently higher or lower tips. In EDA, this helps uncover behavioral patterns that can inform feature engineering or model interpretation.
This histogram shows the distribution of fare amounts in the sampled NYC taxi dataset. Most fares fall below $20, with the x-axis limited to $0–$100 to focus on the majority of observations and exclude extreme outliers. As part of exploratory data analysis, this plot helps identify the central tendency and spread of fare values, revealing that the data is right-skewed. It also highlights the need to handle outliers or consider transformations for fare-related features in predictive modeling.
This line plot illustrates the average tip amount by hour of day based on the pickup time in the NYC taxi dataset. Each point represents the mean tip given during that hour, with the x-axis covering all 24 hours of the day. The plot helps reveal temporal patterns in tipping behavior—for example, peaks during evening or late-night hours may reflect nightlife activity or higher service appreciation. As part of EDA, this insight can inform feature engineering by emphasizing the importance of time-based variables in modeling tip behavior.
2.The question for Q2 and Q3 are in A3Q2_and_Q3:
2.1Here are explaination of my variable choice:
weekend: A binary indicator capturing behavioral differences on weekends; useful for modeling context-driven tipping.
pickup_hour_bin: Bucketed time-of-day variable; reflects observed fluctuations in tipping patterns across hours.
trip_distance: A continuous variable with the strongest positive correlation to tip amount (+0.12); longer trips generally yield higher tips.
passenger_count: An integer feature with weak correlation but potential interaction effects; adds dimensionality without redundancy.
payment_type_vec: One-hot encoded categorical variable with strong negative correlation to tip amount (–0.50); critical for modeling tipping behavior by payment method.
PULocationID_index: A spatial feature encoding pickup zones; although its raw correlation with tips is near zero, location can still influence tipping—recommended to use its one-hot version (PULocationID_vec) for better modeling performance
2.2Here are answer for A3Q2C:
When you specify a series of transformations in a Spark pipeline (such as .filter(), .select(), .withColumn()), Spark does not immediately process the DataFrame. Instead, it builds a logical execution plan—essentially a blueprint of the operations to be performed—using a technique called lazy evaluation. The actual computation only occurs when an action is triggered (e.g., .show(), .collect(), .write()), at which point Spark optimizes the plan and executes it in a distributed manner across the cluster. This approach helps Spark minimize data shuffling and optimize query execution before any resources are consumed. In contrast, Dask also uses lazy evaluation for many of its APIs (especially in dask.dataframe and dask.delayed), building a task graph to represent operations. Like Spark, Dask does not execute computations until an explicit action such as .compute() is called. However, Dask's execution model is more lightweight and flexible, often using threads or processes rather than a JVM-based cluster, and it is generally better suited for Python-native workflows and moderate-scale problems.
2.3Here are report of Q:
The optimal model achieved a Test RMSE (Root Mean Squared Error) of 2.11, which suggests that, on average, the model's predicted tip amount deviates from the actual tip by about $2.11. While this level of error may be acceptable for general trends, it also indicates potential limitations in capturing the full complexity of tipping behavior—possibly due to unobserved factors like driver quality or rider mood.
The best model was obtained with regParam = 0.0 and elasticNetParam = 0.0, meaning it is essentially an unregularized linear regression model. This implies that adding regularization did not improve model generalization, likely because the feature set was not overfitting the training data.
Feature Importance (as measured by absolute coefficient magnitudes): payment_type_vec: 2.73 → This is the most influential predictor, suggesting that the type of payment (e.g., credit card vs. cash) plays a major role in determining the tip amount. This makes sense because tips are more often given in card transactions.
PULocationID_index: 0.019 → Indicates that the pickup location modestly influences tipping, potentially reflecting differences in passenger demographics or trip purposes across neighborhoods.
pickup_hour_bin: 0.018 → Time of day has a small but relevant effect, perhaps linked to nightlife or commuting habits.
passenger_count: 0.008 → Number of passengers has a minimal positive effect, possibly because larger groups tip slightly more.
weekend: -0.107 → Surprisingly, tipping tends to decrease slightly on weekends, contrary to expectations of leisure-related generosity.
trip_distance: ~0.00000004 → Despite intuitive expectations, trip distance has virtually no influence, likely due to scale differences or redundancy with other variables.