Skip to content

Conversation

@Tadusko
Copy link
Collaborator

@Tadusko Tadusko commented Mar 31, 2025

Hi! Sorry that this is bit of a mess, since these changes came over time as I kept working on it locally.

This branch has multiple changes:

  1. Two checks for input that seemed to crash threads (in my runs, the threads were dying over time until none were left).
  • Some titles and descriptions included this null character '\x00', which postgres did not like. Titles and descriptions are now checked for it and it's removed (clean_string)
  • Sometimes, the result object had no photos. That's now checked for now and that execution is broken if there's nothing. if "photos" not in results or "photo" not in results["photos"]:. I don't know if this is the best approach.
  1. Added three new columns for the photo object that seemed useful (or, that I needed at least.)
  • Tags: tags added by the user (string)
  • License: the user-defined license, if any (int, 1–10 explanations here)
  • Accuracy: geographic scale of the photo (int, 1–16 rough scale here
  1. Reduced the number of photos needed to trigger DownloadBatchIsTooLargeError from 4000 to 3000. A more conservative estimate, may return more photos.

@Tadusko Tadusko requested a review from christophfink March 31, 2025 13:51
@christophfink
Copy link
Contributor

Hi @Tadusko,

this looks kinda good, I have a few suggestions, though:

  1. If you change the database schema (adding columns, for instance), add a transition mechanism. I once wrote something like that for twitter history, see here: gitlab.com/christophfink/twitterhistory/.../databaseschemaupdater.py You can copy that file and simply adapt it to flickrhistory (e.g., redefine SCHEMA_UPDATES), and call it from somewhere sensible (for twitterhistory, that‘s here)
  2. Why do you have the conditional in the clean_string() def? (... if isinstance(input_string, str) else input_string) - is there any case when the keys exist but something else than a string is returned? If so, would catching an additional TypeError be more pythonic?
  3. Could you implement the license and accuracy tags using a lookup-table and foreign keys there? So that it is self-documenting?
  4. the 4000 limit was documented somewhere, but now when I google for it, I only find second-hand references that read ‘about 4000 photos‘, so lowering the limit definitely makes sense
  5. I ran the black linter to check whether it would be huge changes, but they were tiny formatting things, so I simply pushed them to your repo - hope that‘s ok :)

Copy link
Contributor

@christophfink christophfink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments in this reply: #13 (comment)

@Tadusko
Copy link
Collaborator Author

Tadusko commented Apr 10, 2025

Hi Chris! Thank you for reviewing this and the comments! I'll get back to it soon – I need to dig a bit to check what were the exact errors I got & why I ended up with the current checks (ChatGPT assisted fixes, for full disclosure).

@christophfink
Copy link
Contributor

@Tadusko sorry for making so massive changes, in the end - once you start touching one thing, a million other things pop up

This is now working quite fine, I still would like to add another worker to fetch missing tags, license, and geo-accuracy information for existing photos

@christophfink christophfink dismissed their stale review April 15, 2025 15:35

Addressed all of my own concerns, requesting @Tadusko‘s review

@christophfink
Copy link
Contributor

@Tadusko I took all your changes in, but restructured a lot of things in general. I‘m going to merge this and release a dev-release. Could you then test it on a ‘real’ installation? (I‘ll write you in Slack about it)

@christophfink christophfink merged commit 9aa670b into DigitalGeographyLab:main Apr 15, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants