Skip to content

Conversation

@jbsparrow
Copy link
Owner

Added a scraper for Discord. It scrapes media from servers, direct messages, group chats, and channels.

The user can use links to specify what they want to scrape. Here is a table specifying how it works:

URL Behavior
discord.com/channels/ Scrape all servers
discord.com/channels/<server_id>/ Scrape a server
discord.com/channels/<server_id>/<channel_id>/ Scrape a channel
discord.com/channels/@me/ Scrape all direct messages & group chats
discord.com/channels/@me/<channel_id>/ Scrape a direct message channel or group chat

@jbsparrow jbsparrow requested a review from NTFSvolume April 16, 2025 01:42
@jbsparrow jbsparrow changed the title FEAT: Add Discord Scraper feat: Add Discord Scraper Apr 16, 2025
Copy link
Collaborator

@NTFSvolume NTFSvolume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs some kind of way to start the scraping at an specific post or date.

Discord servers could have thousand of messages and always starting from the newest one will make every new scrape really slow

Comment on lines 65 to 67
async with self.request_limiter:
servers_url = self.api_url / "v9" / "users" / "@me" / "guilds"
data = await self.client.get_json(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a check for the token and immediately throw a LoginError if the user did not provide one.

Throw the error once and then use an internal attribute to basically disable the crawler for the entire run.

Comment on lines 71 to 76
server_id = server.get("id")
server_name = server.get("name")
if server_id:
new_url = scrape_item.url / server_id
new_scrape_item = scrape_item.create_new(new_url, new_title_part=server_name, add_parent=True)
self.manager.task_group.create_task(self.run(new_scrape_item))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you didn't get a server_id, throw a 422 error

Suggested change
server_id = server.get("id")
server_name = server.get("name")
if server_id:
new_url = scrape_item.url / server_id
new_scrape_item = scrape_item.create_new(new_url, new_title_part=server_name, add_parent=True)
self.manager.task_group.create_task(self.run(new_scrape_item))
if server_id:=server.get("id"):
server_name = server["name"]
new_url = scrape_item.url / server_id
new_scrape_item = scrape_item.create_child(new_url, new_title_part=server_name)
self.manager.task_group.create_task(self.run(new_scrape_item))
scrape_item.add_children()
else:
raise ScrapeError(422)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some servers return an id and some don't, you may want to move the logic to its own method to make sure an error on a single server does cancel all of them

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All servers return an ID.

jbsparrow and others added 11 commits May 20, 2025 10:08
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>
@jbsparrow
Copy link
Owner Author

This also needs some kind of way to start the scraping at an specific post or date.

Discord servers could have thousand of messages and always starting from the newest one will make every new scrape really slow

I was thinking of how to achieve that and the only thing I could come up with at the time was a new db field or a workaround using request parameters. I will take another look at options for this later.

@NTFSvolume NTFSvolume self-assigned this Jun 11, 2025
@NTFSvolume
Copy link
Collaborator

For query params they accept min_id and max_id to filter search results.

Their API docs explain how to get a timestamp from an id: https://discord.com/developers/docs/reference#convert-snowflake-to-datetime

You can apply the reverse logic to a timestamp and them pad the result with zeros to get a valid id for the query.

In CDL you can add 2 custom query params for the user: before and after.

These params will take a valid ISO date as a string like 2025-06-01. The crawler should parse those query params if present and generate valid min_id and max_id params for the results.

@jbsparrow
Copy link
Owner Author

I don't believe that the search API we are using supports that. It uses the cursor in the request JSON which takes a message ID.

The API we are using is unfortunately undocumented but it allows us to filter so that we only receive messages that have media or a specific media type.

@jbsparrow
Copy link
Owner Author

The issue I have with implementing this is that the only viable way I see for us to have a cache so that we aren't re-scraping whole channels is with another database table, or a custom cache system. A custom cache system would be great and would save us from the headaches that aiohttp-client-cache has caused us. I honestly don't think I notice a huge difference with aiohttp-client-cache and I have no clue how well it's working.

A custom cache system would also be great because we would have a better understanding of it, we could optimize it better for CDL, and we can have more customization.

Of course that's a bit of scope creep jumping from a new table to entirely redesigning the caching system.

In any case, I can't think of any way to keep track of the last scraped message effectively. I've thought of using query parameters then we need a unique url or a custom cache key.

@NTFSvolume
Copy link
Collaborator

I'm +0 on the idea of using a custom cache. I'm not oppose to it, but it will be a lot of work.

I'm -1 on adding it to the existing database. If added, it should be completely independent with an entire different database, specially taking in to account that we probably have to iterate a few times over the schema until we get it "right".

The existing database is complicated enough. It has like 8 different pre-startup functions to make sure it has all the required tables and columns because the schema has been modified several times.

@NTFSvolume
Copy link
Collaborator

I will open an issue to eventually switch the core database to an ORM like SQLAlchemy

That will be trivial to do since the current schema for the history db is more or less "concrete" and has not changed for a while. It will simplify and remove a lot of the code and it will allow CDL to support other databases, not just SQLite.

But we would have to handle a few edge cases for incompatibility with older CDL versions so i don't think that can be done right now. And we already have some breaking changes for v7 so i don't think is a good idea mixing it with a database change. Maybe for v8

@jbsparrow
Copy link
Owner Author

jbsparrow commented Jun 20, 2025

For the schema modification checks, we can probably drop most of those and allow a sequential upgrade like we did for upgrading from v4 to v6.

I previously made a project that scrapes discord messages to a db and also has a function to export to a CDL URLs.txt. My database stores a lot of extra information but for CDL off the top of my head we would just need to store the last scraped message id in each channel or server. Media URLs expire after 24 hours so there's no point in storing them in anything but the history database. Not 100% on those columns though, will have to look into that a little bit more if creating a database for it.

As for a custom cache, I totally agree, it would be a lot of work, but I do believe it would be very beneficial. In my experience with our current solution, I don't notice any speed increase and I feel like it has only caused headaches.

Actually, thinking about it right now, the lack of a noticeable speed increase might be because of the AsyncLimiter. Do you think it's limiting the speed at which we can fetch cached requests?

@NTFSvolume
Copy link
Collaborator

The AsyncLimiter does limit the database access. We have to always use it because we don't know if the response will come from the cache or from a new request.

But we can add logic to manually check the cache before making the request with the limiter.

@jbsparrow
Copy link
Owner Author

I thought so. No wonder it's always so slow. Will have to implement that, should significantly boost performance.

@NTFSvolume NTFSvolume marked this pull request as draft October 17, 2025 21:51
@jbsparrow jbsparrow added the new website Request or PR to add support for a new website label Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new website Request or PR to add support for a new website

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants