rpscrape

Horse racing data has been hoarded by a few companies, enabling them to effectively extort the public for access to any worthwhile historical amount. Compared to other sports where historical data is easily and freely available to use and query as you please, racing data in most countries is far harder to come by and is often only available with subscriptions to expensive software.

The aim of this tool is to provide a way of gathering large amounts of historical data at no cost.

Requirements

You must have Python 3.13 or greater, and GIT installed. You can download the latest Python release here. You can download GIT here.

The above Python modules are required, they can be installed using PIP(included with Python):

pip3 install curl_cffi jarowinkler lxml orjson python-dotenv tomli tqdm

Install

git clone https://github.com/joenano/rpscrape.git

Command-Line Options

-d, --date	Single date or date range YYYY/MM/DD-YYYY/MM/DD.
-y, --year	Year or year range (YYYY or YYYY-YYYY).
-r, --region	Region code (e.g., gb, ire).
-c, --course	Numeric course code.
-t, --type	Race type: flat or jumps.

--date-file	File containing dates (one per line, YYYY/MM/DD).

--regions	List or search regions.
--courses	List/search courses or list courses in a region.

Notes

--date and --year are mutually exclusive.

You cannot specify both --region and --course at the same time.

When scraping jumps data, the year refers to the season start. For example, the 2019 Cheltenham Festival is in the 2018-2019 season: use 2018.

Examples

All races on a specific date:

./rpscrape.py -d 2020/10/01

Only races from Great Britain:

./rpscrape.py -d 2020/10/01 -r gb

Date range:

./rpscrape.py -d 2019/12/15-2019/12/18

Flat races in Ireland (2019):

./rpscrape.py -r ire -y 2019 -t flat

Jump races at Ascot (1999–2018):

./rpscrape.py -c 2 -y 1999-2018 -t jumps

Date File Mode

Scrape using a file with dates:

./rpscrape.py --date-file dates.txt

one date per line, format: YYYY/MM/DD.

2020/10/01
2020/11/02
2020/12/03

Searching

List all regions:

./rpscrape.py --regions

Search regions:

./rpscrape.py --regions gb

List all courses:

./rpscrape.py --courses

Search courses:

./rpscrape.py --courses Ascot

List courses in a region:

./rpscrape.py --courses gb

Settings

The user_settings.toml file contains the data fields that can be scraped. You can turn fields on and off by setting them true or false. The order of fields in that file will be maintained in the output csv. The default_settings.toml file should not be edited, its there as a backup and to introduce any new fields without changing user settings.

Scrape Racecards

You can scrape racecards using racecards.py which saves a file containing a json object of racecard information.

There are only three parameter options, --day N, --days N where N is a number 1-2, and --region N where N is a region (gb, ire, etc).

Examples

Scrape today's racecards.

./racecards.py --day 1

Scrape tomorrow's racecards.

./racecards.py --day 2

Scrape today's and tomorrow's racecards.

./racecards.py --days 2

Scrape today's and tomorrow's racecards by region.

./racecards.py --days 2 --region gb

Settings

You can customize which data is included in racecards using the settings file. The scraper uses settings/user_racecard_settings.toml if it exists, otherwise falls back to settings/default_racecard_settings.toml.

To customize:

Copy default_racecard_settings.toml to user_racecard_settings.toml
Edit the settings to enable/disable field groups and data collection options

The settings file lets you control:

Data Collection: Whether to fetch stats and profiles
Field Groups: Which groups of runner fields to include (core, basic_info, performance, jockey, trainer, etc.)

Authentication

Credentials are stored in a .env file in the root directory. Make sure .env is added to .gitignore.

EMAIL=your@email.com
AUTH_STATE=your_auth_state
ACCESS_TOKEN=your_access_token

To find your tokens, login to the site and open the cookies section in the storage tab of your browser's developer tools.

You need the values for auth_state and cognito access token (not to be confused with the AccessToken cookie).

There will be multiple keys beginning with CognitoIdentityServiceProvider, you want the value for the one that ends with .accessToken. It should be directly under email if keys are sorted by name.

Name		Name	Last commit message	Last commit date
Latest commit History 297 Commits
courses		courses
scripts		scripts
settings		settings
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

rpscrape

Table of Contents

Requirements

Install

Command-Line Options

Notes

Examples

Date File Mode

Searching

Settings

Scrape Racecards

Examples

Settings

Authentication

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Languages

joenano/rpscrape

Folders and files

Latest commit

History

Repository files navigation

rpscrape

Table of Contents

Requirements

Install

Command-Line Options

Notes

Examples

Date File Mode

Searching

Settings

Scrape Racecards

Examples

Settings

Authentication

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Languages

Packages