Skip to content

ETL Assignment#49

Open
rameshsolasa17 wants to merge 2 commits intoHedgeApple:masterfrom
rameshsolasa17:master
Open

ETL Assignment#49
rameshsolasa17 wants to merge 2 commits intoHedgeApple:masterfrom
rameshsolasa17:master

Conversation

@rameshsolasa17
Copy link
Copy Markdown

@rameshsolasa17 rameshsolasa17 commented May 7, 2025

The data cleaning process involved standardizing data types, handling missing values, and ensuring consistent formatting across key fields. The item number was converted to a string to prevent potential data loss or truncation. The system creation date was reformatted to ISO 8601 format to maintain consistency and facilitate downstream processing. Currency values in wholesale ($) and map ($) were stripped of special characters and converted to a two-decimal string format to standardize monetary data. Numeric fields, including weight, dimensions, and voltage, were coerced to float data types, with invalid entries set to NaN. Integer fields, such as carton count and bulb count, were also cleaned and converted to a consistent integer data type.

These are the derived fields

  1. Convert UPC to EAN13:
    The UPC values were first converted to strings and any decimal points were handled to ensure numeric integrity. Leading zeros were then added to ensure a consistent 13-digit length.

  2. Determine Prop 65:
    The item materials column was converted to lowercase, and each material was checked against a predefined list of Prop 65 chemicals. If a chemical was found, the prop_65 field was set to 'True'; otherwise, it was set to 'False'.

  3. Set Made to Order:
    The made_to_order field was set to 'False' for all entries, as a default value.

  4. Split Description into Bullet Points:
    The description field was split into up to 8 sentences using regular expression matching for sentence-ending punctuation. The first sentence was assigned as the main description and the next seven as bullet points.

  5. Determine Product Configuration Codes:
    The number of cartons (carton count) was used to set the product__configuration__codes field. If only one carton was present, it was marked as 'finish'; otherwise, it was marked as 'not finish'.

  6. Convert Country to ISO Code:
    The country of origin field was converted to a 3-letter ISO country code using the pycountry library, and unmatched countries were labeled as 'UNK'.

  7. Identify Parent SKU:
    The SKU patterns are inconsistent, making it difficult to derive parent SKUs reliably using product names or item numbers. I attempted to use machine learning models as well with item number, product name, and both as features, but the irregular SKU formats prevented accurate parent SKU generation. A standardized SKU structure would simplify this process.

  8. Set Assembly Required:
    If the product configuration was marked as 'not finish', the attrib__assembly_required field was set to 'True'; otherwise, it was 'False'.

  9. Concatenate Finish Fields:
    The item finish field was created by concatenating item finish 1 and item finish 2, separated by a comma.

  10. Determine UL Certification:
    The safety rating field was checked, and if it was 'UL', the attrib__ul_certified field was set to 'True'; otherwise, it was 'False'.

  11. Set Warranty Years:
    The attrib__warranty_years field was set to a constant value of 2.

  12. Calculate Bulb Count:
    The bulb count was calculated as the sum of bulb 1 count and bulb 2 count, treating missing values as zeros.

  13. List Bulb Types:
    The bulb type field was created by concatenating bulb 1 type and bulb 2 type, separated by a comma.

  14. Determine Bulb Included:
    If either bulb 1 included or bulb 2 included contained 'no' or 'false', the bulb included field was set to 'No'; otherwise, it was 'Yes'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant