Skip to content

pinkstack/cp-finder

Repository files navigation

cp-finder 🦄

Ultra-fast search and analytics engine purposely built for Žejn GROUP - Codemania (TL - Hack) - hackathon in January 2021.

The concept

In other to achieve incredible speed, performance, responsiveness and scalability I'm proposing that the system for this challenge uses the concepts of CQRS - Command Query Responsibility Segregation - a clear separation of read and write sides of the business domain.

With this project I was aiming to address these challenges

  • How to have incredibly fast writes?

    Akka HTTP framework routes request to internal PeopleActor; that actor then internally instantly spawns new child actors. Those child actors then have enough time to process request (in for of commands) before they are written to LevelDB storage.

  • How to have incredibly fast updates?

    PeopleActor keeps N number of live persistent PersonActors alive; so any update, change or delete will hit the live actor or will spawn new one to process the designated command. PersonActor is domain actor that represents one core entity of the system.

  • How to have fast endpoints for analytics?

    AggregateActor is another actor that represents effectively the "read side". Internally it runs "Persistence Query", a query that pulls events/snapshots from LevelDB storage and in parallel generates "aggregates" for the user to consume. These aggregates are then constantly updated with changes from "write side" as they can be live feed updates. So; whenever user is fetching statistics on these analytical endpoints he is reading "in-memory" representation.

  • Trade-offs. In other to achieve these incredible results some trade-offs needed to be made. There is a delay between the time that user writes on "write" side and to "analytics" side to represent that change. The system follows the so called patterns of "eventual consistency" and event sourcing.

cp-finder.png

Benchmarks ⚡️

Although in the real-world these test would be executed with something like Gatling to stress test the whole application. I've resulted to curl to get some measurements after the system has already loaded the whole test CSV file (11000 records).

Creating a person with POST on /people endpoint:

curl --header "Content-Type: application/json" \
  --request POST \
  --data '{"id":"11000","gender":"M","birthDate":"23.06.1953","isoCountry":"gb","testDate":"1.03.2021","testResult":"P","intervention":"quarantine"}' \
  -w "@curl-format.txt" \
  -o /dev/null \
  -s http://localhost:8080/people
     time_namelookup:  0.011883s
        time_connect:  0.012124s
     time_appconnect:  0.000000s
    time_pretransfer:  0.012156s
       time_redirect:  0.000000s
  time_starttransfer:  0.000000s
                     ----------
          time_total:  0.017672s

Fetching the person with GET:

curl --header "Content-Type: application/json" \
  -w "@curl-format.txt" \
  -o /dev/null \
  -s http://localhost:8080/people/11006
     time_namelookup:  0.004118s
        time_connect:  0.004303s
     time_appconnect:  0.000000s
    time_pretransfer:  0.004331s
       time_redirect:  0.000000s
  time_starttransfer:  0.005596s
                     ----------
          time_total:  0.005677s

The PATCH used to update a person via id and JSON payload

curl --header "Content-Type: application/json" \
  --request PATCH \
  --data '{"gender":"F","birthDate":"23.06.1953","isoCountry":"gb","testDate":"1.03.2021","testResult":"P","intervention":"quarantine"}' \
  -w "@curl-format.txt" \
  -o /dev/null \
  -s http://localhost:8080/people/11006
     time_namelookup:  0.004437s
        time_connect:  0.004721s
     time_appconnect:  0.000000s
    time_pretransfer:  0.004759s
       time_redirect:  0.000000s
  time_starttransfer:  0.000000s
                     ----------
          time_total:  0.006930s

... and now the impressive part, numbers and stats endpoints...

  • GET /analytics/positiveByDates - all positive cases, grouped by date.

       time_namelookup:  0.004067s
          time_connect:  0.004297s
       time_appconnect:  0.000000s
      time_pretransfer:  0.004332s
         time_redirect:  0.000000s
    time_starttransfer:  0.006747s
                       ----------
            time_total:  0.006836s
    
  • GET /analytics/positiveByGenderAndState - positive cases grouped by the gender and "state"

       time_namelookup:  0.004433s
          time_connect:  0.004663s
       time_appconnect:  0.000000s
      time_pretransfer:  0.004700s
         time_redirect:  0.000000s
    time_starttransfer:  0.006081s
                       ----------
            time_total:  0.006171s
    
  • GET /analytics/quarantine - number of cases, grouped by country and counted by number of time-left in quarantine.

       time_namelookup:  0.004133s
          time_connect:  0.004374s
       time_appconnect:  0.000000s
      time_pretransfer:  0.004409s
         time_redirect:  0.000000s
    time_starttransfer:  0.005778s
                       ----------
            time_total:  0.005875s
    

Fast,... 🐇

Usage 🚀

Requirements

The service can run as fat JAR on top of any modern JVM or via pre-packaged Docker Image.

🐇 Although out of the scope of the assigment; this project can easily be compiled with GraalVM to also run as "native-image"; that would further reduce memory footprint and improve boot-up time and possibly performance.

Development 🏗

Requirements

  • Install any modern JDK, although it is suggested to use SDKMAN with OpenJDK (14).

    • Development was done on openjdk version "14.0.2" 2020-07-14
  • Install Scala with Scala Built Tool (SBT)

    • Scala version used 2.13.4
    • SBT version used 1.4.6

Running 🏃‍

To run the server please use the following SBT commands, that will spawn the server on http://127.0.0.1:8080 and put everything online.

$ sbt run

Compilation

To compile the project into fat JAR invoke assembly task

$ sbt assembly

The assembly task will compile everything and build a jar that can be invoked like so from the root of the project

$ java -jar cp-finder.jar

The servers also supports the following environment variables that can be interchangeably

  • PORT=8080 - HTTP port where the service will listen to.
  • JOURNAL_LEVELDB_DIR=tmp/journal - File path for LevelDB embedded storage directory.
  • SNAPSHOT_DIR=tmp/snapshots - File path for snapshots storage directory

They can also be passed prior to Java command invocation i.e.:

$ PORT=3030 java -jar cp-finder.jar

Testsuite

To run the full testsuite including integration tests please run

$ sbt test

Tooling 🛠

To populate the service with seed data from CSV - the script bin/feed-csv.rb can be used like so:

./bin/feed-csv.rb data/covidPeople.csv

REST API

CRUD 🚜

Creating a person
POST http://127.0.0.1:8080/people
Content-Type: application/json

{
  "id": 42,
  "gender": "M",
  "isoCountry": "svn",
  "testResult": "P",
  "intervention": "quarantine",
  "birthDate": "23.06.1953",
  "testDate": "31.12.2020"
}
Reading a person
GET http://127.0.0.1:8080/people/42
Updating a person
PATCH http://127.0.0.1:8080/people/42
Content-Type: application/json

{
  "gender": "M",
  "isoCountry": "svn",
  "testResult": "N",
  "intervention": "quarantine",
  "birthDate": "23.06.1953",
  "testDate": "31.12.2020"
}
Deleting a person
DELETE http://127.0.0.1:8080/people/42

Analytics 📈

Number of positive test cases
GET http://127.0.0.1:8080/analytics/positive
{
  "count": 5457
}
Number positive cases grouped by gender
GET http://127.0.0.1:8080/analytics/positiveByGender
{
  "female": 2779,
  "male": 2678
}
Number of positive test cases grouped by the gender, and their current state (quarantine, medical care, hospitalized)
GET http://127.0.0.1:8080/analytics/positiveByGenderAndState
{
  "female": {
    "hospitalized": 919,
    "medical care": 865,
    "quarantine": 995
  },
  "male": {
    "hospitalized": 898,
    "medical care": 916,
    "quarantine": 864
  }
}
Number of positive cases by date for all data
GET http://127.0.0.1:8080/analytics/positiveByDates
{
  "dates": {
    "8.12.2020": 20,
    "9.01.2021": 12,
    "9.03.2020": 14,
    "9.04.2020": 18,
    "9.05.2020": 20,
    "9.06.2020": 14,
    "9.07.2020": 15,
    "9.08.2020": 14,
    "9.09.2020": 16,
    "9.10.2020": 19,
    "9.11.2020": 12,
    "9.12.2020": 14
    ...
Number of positive cases by date*

This endpoint is to be used for following requirements:

  • The number of positive cases by date for all data.
  • Filtering subset of countries and return the number of total cases for all the countries
  • Endpoint groups results per country
GET http://127.0.0.1:8080/analytics/positiveByCountryAndDates
{
  "countries": {
    "aus": {
      "1.03.2020": 3,
      "2.03.2020": 3,
      "3.03.2020": 3,
      "4.03.2020": 9,
      "5.03.2020": 5,
      "6.03.2020": 2,
      "7.03.2020": 9,
      "8.03.2020": 4,
      "9.03.2020": 2,
      "10.03.2020": 6,
      "11.03.2020": 5,
      "12.03.2020": 6,
      "13.03.2020": 1,
      ...
Number of individuals currently in the quarantine

This endpoint is to be used for following requirements:

  • The number of individuals currently in the quarantine.
  • Results are grouped by country
  • Filtering is set to only show cases that are quarantined
  • Time of quarantine is set to 14 days.
  • The response is "live" and changes as time passes by...

With the help of country query parameters the results can be further filtered.

GET http://127.0.0.1:8080/analytics/quarantine
{
  "countries": {
    "aus": {
      "0": 3,
      "1": 2,
      "2": 1,
      "3": 2,
      "4": 3,
      "5": 1,
      "6": 1,
      "7": 3,
      "8": 1,
      ...
  }
}

Author

About

Ultra-fast search and analytics engine built for Žekn TL Codemania Hackathon

Topics

Resources

Stars

Watchers

Forks