Skip to content

Conversation

@dejurin
Copy link
Contributor

@dejurin dejurin commented Feb 19, 2025

Thank you for the great opportunity to join the development of flyscrape.

Preamble

When writing the module, I stuck to the architecture and the API in general. Please consider the module not as a finished product. Still need to review all the steps and logic.

Config

The module has 4 parameters:

  --output.mongodb.uri="mongodb://localhost:27017"
  --output.mongodb.database="db"
  --output.mongodb.collection="posts"
  --output.mongodb.maxpoolsize=10

I would like to point out that there are more parameters that I did not include in the settings, so as not to complicate the script. Also, maxpoolsize parameter can be left out, as it is there by default.

var (
	DefaultMaxPoolSize   = 100
	DefaultBatchSize     = 100
	DefaultFlushInterval = 15 * time.Second
	DefaultTimeout       = 30 * time.Second
	DefaultMaxRetries    = 3
)
Parameter Description
DefaultMaxPoolSize Maximum number of connections in the MongoDB connection pool; set it higher (e.g., 200) for high concurrency or lower (e.g., 50) for low traffic.
DefaultBatchSize Number of documents to buffer before bulk insert; increase (e.g., 500) for more efficient writes or decrease (e.g., 50) for lower memory usage.
DefaultFlushInterval Time interval to automatically flush the buffer; reduce (e.g., 5s) for low-latency or increase (e.g., 30s) for less frequent write operations.
DefaultTimeout Maximum time allowed for MongoDB operations; increase (e.g., 60s) for slow networks or reduce (e.g., 15s) for faster, more reliable setups.
DefaultMaxRetries Number of retry attempts for failed inserts; increase (e.g., 5) for unstable networks or decrease (e.g., 1) if errors are rare.

In most cases, the default settings should be sufficient.

Note: Also the module uses the already declared concurrency setting. Since MongoDB is a separate process, it seemed logical to me.

What to look out for

  1. Since the json module by default (if it is not used explicitly through configurations) outputs scraping data to the command line in the mongodb module it is not possible to hide the data output, only if you disable the json module.
    This issue should be solved at the level of automatic disabling of json module, if for example MongoDB module is used - as an option.
    I did not do it, because I wanted to limit myself to writing a module that does not change the behavior of the program, but works independently. I do not think it is correct to change the API of the program myself.

  2. I don't like it, the limited nature of the data output template. I would like to control the process.
    For example, it doesn't suit me that data will be inserted into the database or json module file just like that:

{
  "url": "https://example.com/",
  "data": {
    "title": "Example Domain",
  },
  "timestamp": {
    "$date": "2025-02-18T23:24:31.333Z"
  }
}

I would like to insert the data into the database at once, for example, like this:

{
  "url": "https://example.com/",
  "title": "Example Domain",
  "timestamp": {
    "$date": "2025-02-18T23:24:31.333Z"
  }
}

or like this

{
  "title": "Example Domain",
}

Thank you so much for giving me the opportunity to share my thoughts. I like this project very much, and I try to contribute as much as possible to its further development. I apologize if my code is not good enough, I am not a confident Go expert.

Remove "Error", 
The "return" already implies an error.
Error strings should not be capitalized (unless beginning with proper nouns or acronyms) or end with punctuation, since they are usually printed following other context.
Remove "Error", 
The "return" already implies an error.
Error strings should not be capitalized (unless beginning with proper nouns or acronyms) or end with punctuation, since they are usually printed following other context.
+ mongodb included (module)
+ example of scraper (js)
+ mongo-driver (require)
@philippta
Copy link
Owner

Thank you for putting in all this effort and writing down all the details!

I will take a deeper look as soon as possible and test it locally.
I just wanted to ensure your effort doesn't go unnoticed!

@dejurin
Copy link
Contributor Author

dejurin commented Feb 21, 2025

Anyway, I like what you're project and intend to continue to benefit from your contributions to open source.
Thank you. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants