- Added
ignore_missing_null_keysargument for running tests. Defaults toFalsewhich is the same behavior as before. If set toTrue, it will ignore any missing keys in the test data that areNonein the extracted data. This is useful when you aded a new filed that does not exist in the older test files. This way you do not need to alwyas updated older test files if not needed.
- Allow a scrapers
extract_taskcallback return a list of dicts, instead of just a single dict. This will allow for a scraper to extract multiple items from a single listing if needed while treating them like separate results.- Warning: If your scraper currently returns a list of extracts within its extract_task callback, the
post_extracttask now runs on each item and not the list as a whole.
- Warning: If your scraper currently returns a list of extracts within its extract_task callback, the
- Always use
utf-8for reading and writing files - Fixes comparing test data (#1)
- Fixed json encoding error in tests when displaying the
typeof a value
- Support for multiple scrapers in a single file, the passed in "scraper_name" will now be used for the config settings. It will still default to the file name if not supplied
- Fixed bug when reading in files to compare when running the scrapers unittests
- Fixed error when checking the files encoding when using the
create-testsub command
- Fixed encoding detection to work when extracting files from s3
- Added some testing around file encoding and rate limiting
- When reading & writing files, use the
cchardetlibrary to detect the correct file encoding
- Fixed reading tests sample data directory on windows now that pathlib is used
- Revert of 0.5.3, do NOT ignore the unicode errors. Will need to find another solution to this when creating tests on both windows and mac/linux
- Ignore unicode errors when reading a file
- Updated filepaths in scraper create-test command. Now in windows it will save the path with forward slashes
/, and not\\(supported by pathlib) - Fixed outdated examples to use a new site
- Added 'default' as a QA option. If the key is not set in the dict returned by the extractor, it will use the default
- Started this change log
- Added
run_idto all scraper logs. Also to the scrapers config values - Have all scraper logs pull its extras from
scraper.log_extras() - Extraction error logs will have the scrapers correct filename and line number rather then where the library threw the exception
- Fixed bug of s3 endpoint not always getting set correctly for custom endpoints
- Added
pre_extract()method to extract class, it will run after the__init__, used for the user to setup class wide vars - Added aws access key id & secret override for the
DOWNLOADERandEXTRACTOR, see config section in readme