Skip to content

Conversation

@JakobWong
Copy link

Description

This PR introduces a new data collector for TuShare (daily frequency) under qlib/scripts/data_collector/tushare/collector.py.
It provides a robust ETL pipeline similar to the Yahoo collector but tailored for TuShare's API and A-share market features.

Key features:

  1. Incremental Update & Resume:
    • Supports "resume from breakpoint" by checking existing CSVs and only downloading data newer than the local max date.
    • update_data_to_bin dumps only newly added dates (using a temporary directory) to improve performance, instead of full redump.
  2. Data Consistency:
    • Includes listed, delisted, and paused stocks (status L,D,P) to avoid survivorship bias.
    • Normalizes output to Qlib standard: date, open, high, low, close, volume, [amount], factor, symbol. amount is optional.
    • Explicitly handles duplicates and ensures monotonic dates.
  3. Robustness:
    • Prefers TuShare's trade_cal for calendar acquisition, with a fallback to Qlib's default.
    • Enforces a baseline requirement for incremental updates (raises error instead of auto-downloading incompatible Yahoo sample data).

Motivation and Context

The existing collectors (Yahoo) are less stable for CN market data. Users often need a production-ready TuShare collector that supports large-scale historical fetch (with rate limits) and daily incremental updates without redownloading entire history. This implementation fills that gap with a structure consistent with Qlib's existing collectors.

How Has This Been Tested?

  • Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • If you are adding a new feature, test on your own test scripts.

Test Details:
Verified with local unit/integration tests (pytest tests/test_tushare_collector.py - Note: test file not included in this PR to keep it minimal, but logic verified):

  1. Normalization: Validated against fixed CSV fixtures (ensuring correct column mapping, date parsing).
  2. Incremental Logic: Verified update_data_to_bin correctly identifies the incremental window and creates temp storage.
  3. Baseline Check: confirmed it raises RuntimeError if qlib_data_1d_dir is missing/invalid during update.

Screenshots of Test Results (if appropriate):

  1. Pipeline test: (Skipped as strict environment required)
  2. Your own tests:
    (All passed locally)
    tests/test_tushare_collector.py ..... [100%]
    5 passed in 2.43s
    

Types of changes

  • Fix bugs
  • Add new feature
  • Update documentation

- Add TuShare day-frequency collector under qlib/scripts/data_collector/tushare/collector.py.
- Supports resume/incremental download (start from last date in existing CSV).
- Incremental update only dumps newly added dates; requires an existing TuShare-based qlib baseline.
- Includes delisted/paused stocks, optional amount, date-only output.
- Calendar prefers TuShare trade_cal, falls back to qlib default.
- Verified with local tests.
@SunsetWolf
Copy link
Collaborator

Hi, @JakobWong , First of all thank you for contributing the code, I see that the description of this pull request implements a lot of functionality. However, it doesn't seem to be clear how to use these features, and I was wondering if it would be possible to add documentation or a docstring to help people understand how to use them.

@JakobWong
Copy link
Author

Hi, @JakobWong , First of all thank you for contributing the code, I see that the description of this pull request implements a lot of functionality. However, it doesn't seem to be clear how to use these features, and I was wondering if it would be possible to add documentation or a docstring to help people understand how to use them.

Hi @SunsetWolf, Thanks for reviewing! The data collector I added is so that users can collect data via the tushare api easily (currently supporting only CN data, daily).

I added documentation for the TuShare daily collector in qlib/scripts/data_collector/tushare/README.md, covering prerequisites (TUSHARE_TOKEN), a one-shot pipeline command, step-by-step download/normalize/dump, incremental updates, and validation. I also listed the TuShare collector in the data_collector overview.

Please let me know if you’d like further details or more examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants