Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions tests/quick_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import logging
from unittest.mock import patch

import pandas as pd
from sklearn.datasets import load_breast_cancer, load_diabetes
from sklearn.model_selection import train_test_split

Expand All @@ -18,18 +19,27 @@

logging.basicConfig(level=logging.INFO)

def embiggen(x):
df = pd.DataFrame(x)
print(f"shape before: {df.shape}")
big = pd.concat([df] * 50, ignore_index=True)
print(f"shape after: {big.shape}")
return big
Comment on lines +22 to +27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The embiggen function can be improved for style, consistency, and readability:

  • Indentation: The function body is incorrectly indented with 8 spaces. According to PEP 8, it should be 4 spaces.
  • Logging: The function uses print statements, but the script is configured to use the logging module. It's better to use logging.info for consistent output.
  • Magic Number: The number 50 is a magic number. For better readability and maintainability, consider defining it as a constant with a descriptive name (e.g., REPLICATION_FACTOR).

I've applied the indentation and logging fixes in the suggestion.

Suggested change
def embiggen(x):
df = pd.DataFrame(x)
print(f"shape before: {df.shape}")
big = pd.concat([df] * 50, ignore_index=True)
print(f"shape after: {big.shape}")
return big
def embiggen(x):
df = pd.DataFrame(x)
logging.info(f"shape before: {df.shape}")
big = pd.concat([df] * 50, ignore_index=True)
logging.info(f"shape after: {big.shape}")
return big


if __name__ == "__main__":
# Patch webbrowser.open to prevent browser login
with patch("webbrowser.open", return_value=False):
use_server = True
# use_server = False
trigger_multi_gpu_threshold = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The trigger_multi_gpu_threshold flag is hardcoded to True. This makes it difficult to run the script without the data-enlarging logic. For better flexibility, consider making this controllable via a command-line argument, for instance, using Python's argparse module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will always trigger a multi GPU setup. Can we change this such that we test both - single & multi GPU?


X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)

if trigger_multi_gpu_threshold:
X_train = embiggen(X_train)
y_train = embiggen(y_train).values.ravel()

tabpfn = TabPFNClassifier.create_default_for_version(
ModelVersion.V2_5, n_estimators=3
)
Expand All @@ -49,6 +59,10 @@
X, y, test_size=0.33, random_state=42
)

if trigger_multi_gpu_threshold:
X_train = embiggen(X_train)
y_train = embiggen(y_train).values.ravel()
Comment on lines +62 to +64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code is a duplicate of lines 39-41. To improve maintainability and follow the Don't Repeat Yourself (DRY) principle, you could extract the data preparation logic (loading, splitting, and 'embiggening') into a separate function. This function could then be called for both the classification and regression tests.


tabpfn = TabPFNRegressor.create_default_for_version(
ModelVersion.V2_5, n_estimators=3
)
Expand Down