feat: Add Discord Scraper #901

jbsparrow · 2025-04-16T01:42:07Z

Added a scraper for Discord. It scrapes media from servers, direct messages, group chats, and channels.

The user can use links to specify what they want to scrape. Here is a table specifying how it works:

URL	Behavior
discord.com/channels/	Scrape all servers
discord.com/channels/<server_id>/	Scrape a server
discord.com/channels/<server_id>/<channel_id>/	Scrape a channel
discord.com/channels/@me/	Scrape all direct messages & group chats
discord.com/channels/@me/<channel_id>/	Scrape a direct message channel or group chat

Refactor scraper, add ability to scrape all of a users' servers, fix an issue with datetime parsing, and adjust the request limiter.

NTFSvolume

This also needs some kind of way to start the scraping at an specific post or date.

Discord servers could have thousand of messages and always starting from the newest one will make every new scrape really slow

cyberdrop_dl/crawlers/discord.py

NTFSvolume · 2025-04-16T03:53:02Z

cyberdrop_dl/crawlers/discord.py

+        async with self.request_limiter:
+            servers_url = self.api_url / "v9" / "users" / "@me" / "guilds"
+            data = await self.client.get_json(


You should add a check for the token and immediately throw a LoginError if the user did not provide one.

Throw the error once and then use an internal attribute to basically disable the crawler for the entire run.

NTFSvolume · 2025-04-16T03:58:22Z

cyberdrop_dl/crawlers/discord.py

+                server_id = server.get("id")
+                server_name = server.get("name")
+                if server_id:
+                    new_url = scrape_item.url / server_id
+                    new_scrape_item = scrape_item.create_new(new_url, new_title_part=server_name, add_parent=True)
+                    self.manager.task_group.create_task(self.run(new_scrape_item))


If you didn't get a server_id, throw a 422 error

Suggested change

server_id = server.get("id")

server_name = server.get("name")

if server_id:

new_url = scrape_item.url / server_id

new_scrape_item = scrape_item.create_new(new_url, new_title_part=server_name, add_parent=True)

self.manager.task_group.create_task(self.run(new_scrape_item))

if server_id:=server.get("id"):

server_name = server["name"]

new_url = scrape_item.url / server_id

new_scrape_item = scrape_item.create_child(new_url, new_title_part=server_name)

self.manager.task_group.create_task(self.run(new_scrape_item))

scrape_item.add_children()

else:

raise ScrapeError(422)

If some servers return an id and some don't, you may want to move the logic to its own method to make sure an error on a single server does cancel all of them

All servers return an ID.

cyberdrop_dl/crawlers/discord.py

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

jbsparrow · 2025-05-20T14:18:44Z

This also needs some kind of way to start the scraping at an specific post or date.

Discord servers could have thousand of messages and always starting from the newest one will make every new scrape really slow

I was thinking of how to achieve that and the only thing I could come up with at the time was a new db field or a workaround using request parameters. I will take another look at options for this later.

NTFSvolume · 2025-06-11T22:33:24Z

For query params they accept min_id and max_id to filter search results.

Their API docs explain how to get a timestamp from an id: https://discord.com/developers/docs/reference#convert-snowflake-to-datetime

You can apply the reverse logic to a timestamp and them pad the result with zeros to get a valid id for the query.

In CDL you can add 2 custom query params for the user: before and after.

These params will take a valid ISO date as a string like 2025-06-01. The crawler should parse those query params if present and generate valid min_id and max_id params for the results.

jbsparrow · 2025-06-15T15:59:29Z

I don't believe that the search API we are using supports that. It uses the cursor in the request JSON which takes a message ID.

The API we are using is unfortunately undocumented but it allows us to filter so that we only receive messages that have media or a specific media type.

jbsparrow · 2025-06-18T18:33:58Z

The issue I have with implementing this is that the only viable way I see for us to have a cache so that we aren't re-scraping whole channels is with another database table, or a custom cache system. A custom cache system would be great and would save us from the headaches that aiohttp-client-cache has caused us. I honestly don't think I notice a huge difference with aiohttp-client-cache and I have no clue how well it's working.

A custom cache system would also be great because we would have a better understanding of it, we could optimize it better for CDL, and we can have more customization.

Of course that's a bit of scope creep jumping from a new table to entirely redesigning the caching system.

In any case, I can't think of any way to keep track of the last scraped message effectively. I've thought of using query parameters then we need a unique url or a custom cache key.

NTFSvolume · 2025-06-18T20:01:00Z

I'm +0 on the idea of using a custom cache. I'm not oppose to it, but it will be a lot of work.

I'm -1 on adding it to the existing database. If added, it should be completely independent with an entire different database, specially taking in to account that we probably have to iterate a few times over the schema until we get it "right".

The existing database is complicated enough. It has like 8 different pre-startup functions to make sure it has all the required tables and columns because the schema has been modified several times.

NTFSvolume · 2025-06-19T12:01:12Z

I will open an issue to eventually switch the core database to an ORM like SQLAlchemy

That will be trivial to do since the current schema for the history db is more or less "concrete" and has not changed for a while. It will simplify and remove a lot of the code and it will allow CDL to support other databases, not just SQLite.

But we would have to handle a few edge cases for incompatibility with older CDL versions so i don't think that can be done right now. And we already have some breaking changes for v7 so i don't think is a good idea mixing it with a database change. Maybe for v8

jbsparrow · 2025-06-20T20:58:41Z

For the schema modification checks, we can probably drop most of those and allow a sequential upgrade like we did for upgrading from v4 to v6.

I previously made a project that scrapes discord messages to a db and also has a function to export to a CDL URLs.txt. My database stores a lot of extra information but for CDL off the top of my head we would just need to store the last scraped message id in each channel or server. Media URLs expire after 24 hours so there's no point in storing them in anything but the history database. Not 100% on those columns though, will have to look into that a little bit more if creating a database for it.

As for a custom cache, I totally agree, it would be a lot of work, but I do believe it would be very beneficial. In my experience with our current solution, I don't notice any speed increase and I feel like it has only caused headaches.

Actually, thinking about it right now, the lack of a noticeable speed increase might be because of the AsyncLimiter. Do you think it's limiting the speed at which we can fetch cached requests?

NTFSvolume · 2025-06-21T01:40:09Z

The AsyncLimiter does limit the database access. We have to always use it because we don't know if the response will come from the cache or from a new request.

But we can add logic to manually check the cache before making the request with the limiter.

jbsparrow · 2025-06-21T04:58:24Z

I thought so. No wonder it's always so slow. Will have to implement that, should significantly boost performance.

Ruff fixes and some migration to newer code

jbsparrow added 12 commits April 15, 2025 18:04

Add Discord Crawler

6b1fb22

Merge branch 'dev' into Discord-Scraper

3705b12

Update scraper

038d165

Fix new scrape items

8eac9e6

Merge branch 'dev' into Discord-Scraper

c471b03

Remove 'Loose Files' folder

6f19e34

Update scrape items

18542e9

Ruff fixes

1b4b7b9

More fixes

15eb6e7

Refactor and add scrape all servers

887c227

Refactor scraper, add ability to scrape all of a users' servers, fix an issue with datetime parsing, and adjust the request limiter.

Update docs

4e1621b

Update discord.py

556abc5

jbsparrow requested a review from NTFSvolume April 16, 2025 01:42

jbsparrow changed the title ~~FEAT: Add Discord Scraper~~ feat: Add Discord Scraper Apr 16, 2025

NTFSvolume requested changes Apr 16, 2025

View reviewed changes

jbsparrow and others added 11 commits May 20, 2025 10:08

Update cyberdrop_dl/crawlers/discord.py

dfa9ade

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

Update cyberdrop_dl/crawlers/discord.py

841ec7b

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

Update cyberdrop_dl/crawlers/discord.py

bc0dd8a

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

Update cyberdrop_dl/crawlers/discord.py

6b0c571

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

Update cyberdrop_dl/crawlers/discord.py

163a2bd

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

Update cyberdrop_dl/crawlers/discord.py

158bd94

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

Update cyberdrop_dl/crawlers/discord.py

8d31f2b

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

Update cyberdrop_dl/crawlers/discord.py

87f094b

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

Update cyberdrop_dl/crawlers/discord.py

b506549

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

Update cyberdrop_dl/crawlers/discord.py

fb8aa49

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

Update cyberdrop_dl/crawlers/discord.py

8e4e6f5

Co-authored-by: NTFSvolume <172021377+NTFSvolume@users.noreply.github.com>

NTFSvolume self-assigned this Jun 11, 2025

NTFSvolume marked this pull request as draft October 17, 2025 21:51

jbsparrow added 4 commits November 3, 2025 17:28

Merge branch 'dev' into Discord-Scraper

17c2072

Ruff fixes

38ae6c7

Ruff fixes and some migration to newer code

More migration

d6d869f

Ruff fixes

e30c8fe

jbsparrow added the new website Request or PR to add support for a new website label Nov 3, 2025

jbsparrow mentioned this pull request Nov 6, 2025

[FEATURE] Implement Custom Cache #1465

Open

Uh oh!

feat: Add Discord Scraper #901

Are you sure you want to change the base?

feat: Add Discord Scraper #901

Uh oh!

Conversation

jbsparrow commented Apr 16, 2025

Uh oh!

NTFSvolume left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NTFSvolume Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

NTFSvolume Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

NTFSvolume Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

jbsparrow May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jbsparrow commented May 20, 2025

Uh oh!

NTFSvolume commented Jun 11, 2025

Uh oh!

jbsparrow commented Jun 15, 2025

Uh oh!

jbsparrow commented Jun 18, 2025

Uh oh!

NTFSvolume commented Jun 18, 2025

Uh oh!

NTFSvolume commented Jun 19, 2025

Uh oh!

jbsparrow commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NTFSvolume commented Jun 21, 2025

Uh oh!

jbsparrow commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NTFSvolume left a comment •

edited

Loading

jbsparrow commented Jun 20, 2025 •

edited

Loading