Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The data cleaning process involved standardizing data types, handling missing values, and ensuring consistent formatting across key fields. The item number was converted to a string to prevent potential data loss or truncation. The system creation date was reformatted to ISO 8601 format to maintain consistency and facilitate downstream processing. Currency values in wholesale ($) and map ($ ) were stripped of special characters and converted to a two-decimal string format to standardize monetary data. Numeric fields, including weight, dimensions, and voltage, were coerced to float data types, with invalid entries set to NaN. Integer fields, such as carton count and bulb count, were also cleaned and converted to a consistent integer data type.
These are the derived fields
Convert UPC to EAN13:
The UPC values were first converted to strings and any decimal points were handled to ensure numeric integrity. Leading zeros were then added to ensure a consistent 13-digit length.
Determine Prop 65:
The item materials column was converted to lowercase, and each material was checked against a predefined list of Prop 65 chemicals. If a chemical was found, the prop_65 field was set to 'True'; otherwise, it was set to 'False'.
Set Made to Order:
The made_to_order field was set to 'False' for all entries, as a default value.
Split Description into Bullet Points:
The description field was split into up to 8 sentences using regular expression matching for sentence-ending punctuation. The first sentence was assigned as the main description and the next seven as bullet points.
Determine Product Configuration Codes:
The number of cartons (carton count) was used to set the product__configuration__codes field. If only one carton was present, it was marked as 'finish'; otherwise, it was marked as 'not finish'.
Convert Country to ISO Code:
The country of origin field was converted to a 3-letter ISO country code using the pycountry library, and unmatched countries were labeled as 'UNK'.
Identify Parent SKU:
The SKU patterns are inconsistent, making it difficult to derive parent SKUs reliably using product names or item numbers. I attempted to use machine learning models as well with item number, product name, and both as features, but the irregular SKU formats prevented accurate parent SKU generation. A standardized SKU structure would simplify this process.
Set Assembly Required:
If the product configuration was marked as 'not finish', the attrib__assembly_required field was set to 'True'; otherwise, it was 'False'.
Concatenate Finish Fields:
The item finish field was created by concatenating item finish 1 and item finish 2, separated by a comma.
Determine UL Certification:
The safety rating field was checked, and if it was 'UL', the attrib__ul_certified field was set to 'True'; otherwise, it was 'False'.
Set Warranty Years:
The attrib__warranty_years field was set to a constant value of 2.
Calculate Bulb Count:
The bulb count was calculated as the sum of bulb 1 count and bulb 2 count, treating missing values as zeros.
List Bulb Types:
The bulb type field was created by concatenating bulb 1 type and bulb 2 type, separated by a comma.
Determine Bulb Included:
If either bulb 1 included or bulb 2 included contained 'no' or 'false', the bulb included field was set to 'No'; otherwise, it was 'Yes'.