+ Module: MongoDB #85
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thank you for the great opportunity to join the development of flyscrape.
Preamble
When writing the module, I stuck to the architecture and the API in general. Please consider the module not as a finished product. Still need to review all the steps and logic.
Config
The module has 4 parameters:
I would like to point out that there are more parameters that I did not include in the settings, so as not to complicate the script. Also, maxpoolsize parameter can be left out, as it is there by default.
DefaultMaxPoolSizeDefaultBatchSizeDefaultFlushIntervalDefaultTimeoutDefaultMaxRetriesIn most cases, the default settings should be sufficient.
Note: Also the module uses the already declared concurrency setting. Since MongoDB is a separate process, it seemed logical to me.
What to look out for
Since the json module by default (if it is not used explicitly through configurations) outputs scraping data to the command line in the mongodb module it is not possible to hide the data output, only if you disable the json module.
This issue should be solved at the level of automatic disabling of json module, if for example MongoDB module is used - as an option.
I did not do it, because I wanted to limit myself to writing a module that does not change the behavior of the program, but works independently. I do not think it is correct to change the API of the program myself.
I don't like it, the limited nature of the data output template. I would like to control the process.
For example, it doesn't suit me that data will be inserted into the database or json module file just like that:
{ "url": "https://example.com/", "data": { "title": "Example Domain", }, "timestamp": { "$date": "2025-02-18T23:24:31.333Z" } }I would like to insert the data into the database at once, for example, like this:
{ "url": "https://example.com/", "title": "Example Domain", "timestamp": { "$date": "2025-02-18T23:24:31.333Z" } }or like this
{ "title": "Example Domain", }Thank you so much for giving me the opportunity to share my thoughts. I like this project very much, and I try to contribute as much as possible to its further development. I apologize if my code is not good enough, I am not a confident Go expert.