Skip to content

Scraper cleaning #5

@Rich-T-kid

Description

@Rich-T-kid

Implement Data Cleaning for Scraped Data (Dates and Default Values)

Objective

Develop a robust data cleaning mechanism for the scraped data, focusing on transforming date fields into proper SQL and Golang-compatible values. Additionally, ensure other fields are cleaned, and all default values are sensible and consistent.

Tasks

  1. Date Transformation:

    • Parse date fields from the scraped data into proper SQL DATETIME format.
    • Convert dates into Go-compatible time.Time objects for internal processing.
    • Handle various date formats gracefully (e.g., "November 23 · 10 PM" or "December 7 · 10am - December 15 · 6pm EST").
    • Add logging for date fields that fail to parse and ensure such records are skipped or marked for manual review.
  2. Cleaning Other Fields:

    • Strip unnecessary whitespace, special characters, or HTML tags from text fields.
    • Ensure all numerical fields (e.g., price, event capacity) are converted to appropriate data types (e.g., float64 or int).
    • Normalize location fields (e.g., city, state, zip code) to ensure consistent formatting.
  3. Default Values:

    • Set sensible default values for missing or invalid fields:
      • Dates: Use NULL or a placeholder value (e.g., 0001-01-01 00:00:00 for Go's zero value).
      • Text fields: Use "Unknown" or "Not Provided".
      • Numerical fields: Use 0 or an appropriate minimum value.
  4. Integration:

    • Integrate the cleaning logic into the scraping workflow so that all data is cleaned before being saved to the database.
    • Add a separate function or module to handle data cleaning to keep the scraper modular and maintainable.
  5. Testing:

    • Write test cases to ensure:
      • Date fields are parsed correctly for various formats.
      • Fields are properly cleaned and transformed into their target formats.
      • Default values are applied for missing or invalid fields.
    • Include edge cases for invalid or unexpected data formats.

Acceptance Criteria

  • Dates are consistently transformed into SQL DATETIME and Go time.Time formats.
  • All fields are cleaned and normalized, removing any invalid or extraneous data.
  • Default values are applied where data is missing or invalid.
  • The data cleaning process is integrated into the scraping workflow without significant performance degradation.
  • All code is well-documented and tested.

Additional Notes

  • Consider using Go libraries such as time.Parse for date handling and strings.TrimSpace for cleaning string fields.
  • For edge cases or unexpected formats, log the issue and skip the record instead of breaking the pipeline.
  • Maintain modularity by separating the cleaning logic into its own package or function.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions