-
-
Notifications
You must be signed in to change notification settings - Fork 46
feat: Add Discord Scraper #901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Refactor scraper, add ability to scrape all of a users' servers, fix an issue with datetime parsing, and adjust the request limiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs some kind of way to start the scraping at an specific post or date.
Discord servers could have thousand of messages and always starting from the newest one will make every new scrape really slow
cyberdrop_dl/crawlers/discord.py
Outdated
| async with self.request_limiter: | ||
| servers_url = self.api_url / "v9" / "users" / "@me" / "guilds" | ||
| data = await self.client.get_json( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add a check for the token and immediately throw a LoginError if the user did not provide one.
Throw the error once and then use an internal attribute to basically disable the crawler for the entire run.
cyberdrop_dl/crawlers/discord.py
Outdated
| server_id = server.get("id") | ||
| server_name = server.get("name") | ||
| if server_id: | ||
| new_url = scrape_item.url / server_id | ||
| new_scrape_item = scrape_item.create_new(new_url, new_title_part=server_name, add_parent=True) | ||
| self.manager.task_group.create_task(self.run(new_scrape_item)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you didn't get a server_id, throw a 422 error
| server_id = server.get("id") | |
| server_name = server.get("name") | |
| if server_id: | |
| new_url = scrape_item.url / server_id | |
| new_scrape_item = scrape_item.create_new(new_url, new_title_part=server_name, add_parent=True) | |
| self.manager.task_group.create_task(self.run(new_scrape_item)) | |
| if server_id:=server.get("id"): | |
| server_name = server["name"] | |
| new_url = scrape_item.url / server_id | |
| new_scrape_item = scrape_item.create_child(new_url, new_title_part=server_name) | |
| self.manager.task_group.create_task(self.run(new_scrape_item)) | |
| scrape_item.add_children() | |
| else: | |
| raise ScrapeError(422) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If some servers return an id and some don't, you may want to move the logic to its own method to make sure an error on a single server does cancel all of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All servers return an ID.
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
I was thinking of how to achieve that and the only thing I could come up with at the time was a new db field or a workaround using request parameters. I will take another look at options for this later. |
|
For query params they accept Their API docs explain how to get a timestamp from an id: https://discord.com/developers/docs/reference#convert-snowflake-to-datetime You can apply the reverse logic to a timestamp and them pad the result with zeros to get a valid id for the query. In CDL you can add 2 custom query params for the user: These params will take a valid ISO date as a string like |
|
I don't believe that the search API we are using supports that. It uses the cursor in the request JSON which takes a message ID. The API we are using is unfortunately undocumented but it allows us to filter so that we only receive messages that have media or a specific media type. |
|
The issue I have with implementing this is that the only viable way I see for us to have a cache so that we aren't re-scraping whole channels is with another database table, or a custom cache system. A custom cache system would be great and would save us from the headaches that A custom cache system would also be great because we would have a better understanding of it, we could optimize it better for CDL, and we can have more customization. Of course that's a bit of scope creep jumping from a new table to entirely redesigning the caching system. In any case, I can't think of any way to keep track of the last scraped message effectively. I've thought of using query parameters then we need a unique url or a custom cache key. |
|
I'm +0 on the idea of using a custom cache. I'm not oppose to it, but it will be a lot of work. I'm -1 on adding it to the existing database. If added, it should be completely independent with an entire different database, specially taking in to account that we probably have to iterate a few times over the schema until we get it "right". The existing database is complicated enough. It has like 8 different pre-startup functions to make sure it has all the required tables and columns because the schema has been modified several times. |
|
I will open an issue to eventually switch the core database to an ORM like SQLAlchemy That will be trivial to do since the current schema for the history db is more or less "concrete" and has not changed for a while. It will simplify and remove a lot of the code and it will allow CDL to support other databases, not just SQLite. But we would have to handle a few edge cases for incompatibility with older CDL versions so i don't think that can be done right now. And we already have some breaking changes for v7 so i don't think is a good idea mixing it with a database change. Maybe for v8 |
|
For the schema modification checks, we can probably drop most of those and allow a sequential upgrade like we did for upgrading from v4 to v6. I previously made a project that scrapes discord messages to a db and also has a function to export to a CDL URLs.txt. My database stores a lot of extra information but for CDL off the top of my head we would just need to store the last scraped message id in each channel or server. Media URLs expire after 24 hours so there's no point in storing them in anything but the history database. Not 100% on those columns though, will have to look into that a little bit more if creating a database for it. As for a custom cache, I totally agree, it would be a lot of work, but I do believe it would be very beneficial. In my experience with our current solution, I don't notice any speed increase and I feel like it has only caused headaches. Actually, thinking about it right now, the lack of a noticeable speed increase might be because of the |
|
The But we can add logic to manually check the cache before making the request with the limiter. |
|
I thought so. No wonder it's always so slow. Will have to implement that, should significantly boost performance. |
Ruff fixes and some migration to newer code
Added a scraper for Discord. It scrapes media from servers, direct messages, group chats, and channels.
The user can use links to specify what they want to scrape. Here is a table specifying how it works: