CCASS-Plotter

Introduction

This is a project to visualize CCASS Shareholding data from HKEX. The data is updated daily at T+1 and 1-year rolling data is available.
There are two main parts:

Scraping
Web application
- SQLite Link: http://54.202.152.22:8000/
  It is hosted in an Amazon EC2 Free Tier instance with limited cpu resources and 1GB RAM. So do expect a bit of slowness, compared to a local host. There may be downtime between 01:00 HKT and 01:30 HKT due to a daily restart
- QuestDB (Productionalization) Link: http://210.6.212.233:8000/
  It is hosted in a machine with 16GB RAM and 6 cores.
  Also, QuestDB is a time-series database which is much more efficient in storing / querying time series data.
  By comparing the two site, the one using QuestDB is return data more swiftly due to faster query.

Scraping

To scrape data from https://www3.hkexnews.hk/sdw/search/searchsdw.aspx

Tools used:

Selenium
For switching dates and stock code
BeautifulSoup
For parsing the data table
SQLite3 (A simple database to store the shareholding data)
- Table 1: CCASS Shareholding by date and stock code
- Table 2: Stock List by date
QuestDB (A time series database)
- Table 1: CCASS Shareholding by date and stock code
- Table 2: Stock List by date

How to run:

QuestDB setup (Skip this if you are not using QuestDB) Please pull the latest QuestDB docker image from and follow the instruction here: https://hub.docker.com/r/questdb/questdb.
I ran this in order to persist the data:

$ docker run -p 9000:9000  \
      -p 8812:8812 \
      -v local/dir:/root/.questdb/db \
      questdb/questdb

Set up .env file
- DB: QUEST|SQLITE

DB=QUEST

Download chromedriver and save it to project folder
Please check your chrome version (Settings -> About Chrome) and look for the corresponding driver https://chromedriver.chromium.org/downloads
Run scraper.py
Multithreading is available at date level. For each day, there are >5000 stocks to scrape.

def main():
    # This is single thread
    scrape_task(0, DATE_RANGE_LIST[::-1][0])
    
    # This is multi thread
    # executor = ThreadPoolExecutor(max_workers=12)
    # jobs = [executor.submit(scrape_task, i, DATE_RANGE_LIST[::-1][i]) for i in range(0, len(DATE_RANGE_LIST))]

Data Scraped and Stored for the web application:

I only scrape records with shareholdiing > 0.1% for first 2000 stocks (refer to config.py).
Therefore in total 16.67M rows for CCASS shareholding dy date and stock code. In terms of storage space, it takes ~3.4GB for SQLite and 4.3GB for QuestDB.

Web application

A simple interactive dashboard that has two tabs:

CCASS Shareholding trend visualization
Transaction finder based on change in CCASS shareholding

Tools used:

Dash
A python library which provides an integrated back and fron end framework for rapid web app development, built on top of plotly.js and react.js.
Dash is selected because it is my first time trying a python based front end.
Waitress
A production-quality pure-Python WSGI server.
python-dotenv
To load the .env file and get the environment type (DEV|PROD)

How to run:

QuestDB setup (Skip this if you are not using QuestDB) Please pull the latest QuestDB docker image from and follow the instruction here: https://hub.docker.com/r/questdb/questdb.
I ran this in order to persist the data:

$ docker run -p 9000:9000  \
      -p 8812:8812 \
      -v local/dir:/root/.questdb/db \
      questdb/questdb

Set up .env file
- BASE_ENV: DEV|PROD
- DB: QUEST|SQLITE

BASE_ENV=DEV
DB=QUEST

Run scrape.py so that table exists in the DB. As discussed above, the data is in the size of 3-4GB so it is not uploaded.
Run app.py Dash development server will be used if BASE_ENV is DEV, which supports features like hot-reloading, error messages etc.
Waitress server will be used if BASE_ENV is PROD, which gives much better performance than development server.

if __name__ == "__main__":
    if env == 'DEV':
        app.run_server(host=HOST, port=PORT, debug=True)
    elif env == 'PROD':
        serve(app.server, host=HOST, port=PORT,)

How to host:

Get a AWS Free tier (t2.micro) Instance
- It works for SQLite only
- QuestDB needs at least 2-3 GB Memory from my experience so I am using a local machine to host the Productionized Appliaction
Setting the inbound rule and open firewall in the instance
- Allow traffic from 0.0.0.0/0 for tcp port 8000, 443, 80 and rdp port 3389
- You may also need to open port 9000 / 8812 if you want to interact with QuestDB
Setting up the local environment in the instance
- Anaconda
- Chrome
- Git
- Docker (QuestDB only)
Schedule daily restart / kill job for the web app / container
- Start at 0100 and Kill at 0059

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.py		config.py
kill_app.py		kill_app.py
kill_server.bat		kill_server.bat
requirements.txt		requirements.txt
run_server.bat		run_server.bat
scraper.py		scraper.py
ui_components.py		ui_components.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CCASS-Plotter

Introduction

Scraping

Tools used:

How to run:

Data Scraped and Stored for the web application:

Web application

Tools used:

How to run:

How to host:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CCASS-Plotter

Introduction

Scraping

Tools used:

How to run:

Data Scraped and Stored for the web application:

Web application

Tools used:

How to run:

How to host:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages