Skip to content

Let estimated rows work with public data#889

Draft
mccalluc wants to merge 4 commits intomainfrom
861-estimated-rows-and-public-data
Draft

Let estimated rows work with public data#889
mccalluc wants to merge 4 commits intomainfrom
861-estimated-rows-and-public-data

Conversation

@mccalluc
Copy link
Contributor

@mccalluc mccalluc commented Mar 5, 2026

  • Fix Estimated rows isn't used with public data #861
  • Simplify UI code in analysis_panel/__init__.py: Minimize the code inside the if-else, and move the return outside.
  • Use polars sample to generate a Lazyframe of the requested size, either smaller or larger than the original. (lazyframe -> dataframe -> lazyframe is kludgy, but I don't have a better alternative.)
  • Inside make_accuracy_histogram, rename row_count parameter to max_length, since that's what it is used for.

For reviewer:

  • Adding more computation will slow down the interface a little bit. With a thousand line file I don't notice a difference, but is this the direction we want to go in? (See Latency is too high for epsilon slider #863)
  • The only change in the UI is in the plot, so I don't have good idea for an automated test. Let me know if you have ideas?

@ekraffmiller
Copy link
Member

@mccalluc I tried this with a 1000 row file and it worked fine, but when I tried a 10,000 row file the app seemed to freeze up. Is this too big for DP Wizard?
PUMS5extract10000.csv

@mccalluc
Copy link
Contributor Author

mccalluc commented Mar 9, 2026

@ekraffmiller , thanks for the failing example! Which step are you stuck at?

(We are reading the entire file to infer schema, and then the sampling and preview time could also slow things down. Both of those could be done with just the first n rows, at the risk of being surprised if later rows aren't like the earlier rows.)

@ekraffmiller
Copy link
Member

@mccalluc screen-capture (3).webm
Here is a video, basically what happens is after I select the file, the page becomes unresponsive - I'm trying to update the other input fields and I can't. Eventually I get an error from Chrome that the page is unresponsive.

@mccalluc
Copy link
Contributor Author

mccalluc commented Mar 9, 2026

@ekraffmiller : The video isn't working for me, though the errors are different in FF and Chrome, but that's just FYI: I have enough information: Thank you!

  • The naive approach to sampling is almost certainly O(n^2). I think we could instead pick row indexes by taking some large-ish prime, modulo the number of rows.
  • Filed another issue for more general profiling: Profile large CSVs... and warn on upload #901. In my court.

@ekraffmiller
Copy link
Member

@ekraffmiller : The video isn't working for me, though the errors are different in FF and Chrome, but that's just FYI: I have enough information: Thank you!

  • The naive approach to sampling is almost certainly O(n^2). I think we could instead pick row indexes by taking some large-ish prime, modulo the number of rows.
  • Filed another issue for more general profiling: Profile large CSVs #901. In my court.

Ok, do you want to work more on this, or consider the size issue separately?

@mccalluc
Copy link
Contributor Author

mccalluc commented Mar 9, 2026

@ekraffmiller : Let me move this back to draft: I think the sampling does fill a gap, but with it probably being O(n^2), it's not something to merge now. Thanks for catching this.

@mccalluc mccalluc marked this pull request as draft March 9, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Pending

Development

Successfully merging this pull request may close these issues.

Estimated rows isn't used with public data

2 participants