Skip to content

Testing

Raine Revere edited this page Dec 26, 2025 · 33 revisions

Quick Start

yarn test            # unit and jsdom tests
yarn test:puppeteer  # puppeteer (docker required)
yarn test:ios        # webdriverio (browserstack account required)

Stack

  • Vitest
  • JSDOM
  • React Testing Library
  • Puppeteer
  • Browserless
  • Docker
  • WebdriverIO
  • Github Actions

Reporting Bugs

Issue Titles

If a bug is platform specific, put the platform in brackets at the beginning of the title. If the bug is on all platforms, the prefix can be omitted.

Prefix Meaning
[Mobile] iOS / Mobile Safari / Android
[iOS] iOS / Mobile Safari
[iOS Capacitor] iOS Capacitor build, but not Mobile Safari
[Android] Android
[Chrome] Desktop Chrome
(no prefix) Issue present on all platforms

Headings

When reporting a bug, use these standard three headings: Steps to Reproduce, Current Behavior, and Expected Behavior. Describing something as "wrong", "not working", "broken", etc, is not sufficient. Broken behavior can only be understood in terms of the difference between current and expected behavior.

These headings should be populated as follows:

Steps to Reproduce

Describe the exact steps needed for someone else to trigger the unexpected behavior.

Current Behavior

The current (wrong) behavior that is observed when the steps are followed. Typically this refers to the main branch. (When describing a regression in a PR, this can refer to the PR branch and should be accompanied by a commit hash for clarity.

This should only describe the result of following the steps. Any conditions required to observe the behavior should go in Steps to Reproduce.

Expected Behavior

The expected (intended) behavior that should occur when the steps are followed. Typically this refers to the behavior that has not yet been implemented. (When describing a regression on a PR branch, this can refer to the existing, correct behavior on main.)

Be specific.

e.g.

  • NO: Should work correctly.
  • NO: Thought should be expanded.
  • YES: b should be expanded.

Often the best approach is to state the expected specific behavior followed by the expected general behavior:

  • b should be expanded.
  • Subthoughts with no siblings should be expanded.

Here's a real example from #2733:

Steps to Reproduce

- x
  - b
  - a
  - =sort
    - Alphabetical
      - Desc
  1. Set the cursor on x.
  2. Activate New Subthought Above (Meta + Shift + Enter).
  3. Move cursor up/down.

Current Behavior

  • Cursor up moves the cursor from the empty thought to a.
  • Cursor down: Nothing happens.

Expected Behavior

  • Cursor up should move the cursor from the empty thought to x.
  • Cursor down should move the cursor from the empty thought to b.

Test Levels

The project has multiple levels of automated testing, from single function unit tests up to realistic end-to-end (E2E) tests that run tests against an actual device or browser.

Use the lowest level that is sufficient for your test case. If your test case does not require a DOM, use a unit test. If it requires a DOM but is not browser or device-specific, use an RTL test. Higher level tests may provide a more realistic testing environment, but they are slower and, in the case of webdriverio on browserstack, cost per minute of usage.

You can find the test files spread throughout the project in __test__ directories.

1. Unit Tests

⚡️⚡️⚡️ 1–20ms

Basic unit tests are great for testing pure functions directly.

Related tests: actions, selectors, util

2. Store Tests

⚡️⚡️⚡️ 1–20ms

The shortcut tests require dispatching Redux actions but do not need a DOM. You can use the helpers createTestStore and executeShortcut to operate directly on a Redux store, then make assertions about store.getState(). This allows shortcuts to be tested independently of the user device.

Related tests: shortcuts

3. JSDOM Tests

⚡️⚡️ 1–1000ms

Anything that tests a rendered component requires a DOM. If there are no browser or device quirks, you can get away with testing against an emulated DOM (jsdom) which is cheaper and faster than a real browser.

Related tests: components

4. E2E Tests

⚡️ 1–2s

yarn test:puppeteer

E2E, or End-to-End, tests involve running a real browser or device and controlling it with an automation driver. You can perform common user actions like touch, click, and type. These tests are the slowest and most expensive to run.

  • puppeteer (Chrome) - Requires docker
  • webdriverio (Mobile devices) - Requires a browserstack account

To run WebdriverIO tests, add BROWSERSTACK_USERNAME=your_username and BROWSERSTACK_ACCESS_KEY=your_access_key to .env.test.local in the project root and run yarn test:e2e:ios. (under construction)

Related tests: e2e

5. Visual snapshot tests

⚡️ 1–2s

Snapshot tests are a specific type of puppeteer test used to prevent visual regressions. They automate taking a screenshot on your PR branch and then comparing it to a reference screenshot in main. If the screenshot differs by a certain number of pixels, then it is considered a regression and the test will fail. In the case of a failed snapshot test, a visual diff will be generated that allows you to see why it failed.

Do not use snapshot tests for testing behavior (such as the result of a user action). Instead, select DOM elements by aria label or data-testid. Use snapshot tests for covering visual regressions such as positioning, layout, svg rendering, and general appearance of components.

In the following example, the superscript position broke so the snapshot test failed. The expected snapshot is on the left; the current snapshot is on the right.

font-size-22-superscript-1-diff

When running the tests locally, a link to the visual diff will be output in your shell. When running the tests in GitHub Actions, the visual diff can be downloaded from the artifact link added to the test output under "Upload snapshot diff artifact":

Screenshot 2024-11-08 at 11 30 25 AM

If you are absolutely sure that the change is desired, and your PR was supposed to change the visual appearance of em, then run the snapshot test with -u to update the reference snapshot.

Test Flags

testFlags are used to alter runtime behavior of the app during tests. This is generally forbidden, as the automated test environment should be as close as possible to production so that it is testing the same behavior the end user sees. But there are some conditions that are difficult or impossible to create through normal user behavior (e.g. network latency) or that can enhance test readability (e.g. visualizations) when runtime alternation is warranted.

Drag-and-drop visualization

You can enable drop target visualization boxes by running em.testFlags.simulateDrop = true in the JS console or setting testFlags.simulateDrop to true in https://github.com/cybersemics/em/blob/ad173daa1d01c12003e33973f863072fdc852023/src/e2e/testFlags.ts#L18-L19.

Screenshot 2025-12-24 16 01 49

Manual Test Cases

Various test cases that may need to be tested manually.

Touch Events

  • Enter edit mode (#1208)
  • Preserve editing: true (#1209)
  • Preserve editing: false (#1210)
  • No uncle loop (#908)
  • Tap hidden root thought (#1029)
  • Tap hidden uncle (#1128-1)
  • Tap empty Content (#1128-2)
  • Scroll (#1054)
  • Swipe over cursor (#1029-1)
  • Swipe over hidden thought (#1147)
  • Preserve editing on switch app (#940)
  • Preserve editing clicking on child edge (#946)
  • Auto-Capitalization on Enter (#999)

Render

Test enter and leave on each of the following actions:

  1. New Thought

  2. New Subthought

  3. Move Thought Up/Down

  4. Indent/Outdent

  5. SubcategorizeOne/All

  6. Toggle Pin Children

  7. Basic Navigation

    - x
      - y
        - z
          - r
            - o
        - m
          - o
        - n
    
  8. Word Wrap

    - a
      - This is a long thought that after enough typing will break into multiple lines.
      - forcebreakkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
      - c
    
  9. Toggle Table View

    - a
      - =view
        - Table
      - b
        - b1
      - c
        - c1
    
  10. Table View - Column 2 Descendants

    - a
      - =view
        - Table
      - c
        - c1
          - c2
            - c3
    
  11. Table View - Vertical Alignment

    - a
      - =view
        - Table
      - b
        - b1
        - b2
        - b3
      - c
        - c1
        - c2
        - c3
    
    - a
      - =view
        - Table
      - b
        - This is a long thought that after enough typing will break into multiple lines.
      - c
        - c1
    
    - a
      - =view
        - Table
      - This is a long thought that after enough typing will break into multiple lines.
        - b1
        - b2
      - c
        - c1
    
    - a
      - =view
        - Table
      - This is a long thought that after enough typing will break into multiple lines.
        - b1
        - b2
      - c
        - c1
    
  12. Expand/collapse large number of thoughts at once

    - one
      - =pinChildren
        - true
      - a
        - =view
          - Table
        - c
          - c1
            - c2
              - c3
                - c4
        - This is a long thought that after enough typing will break into multiple lines.
          - b1
          - b2
        - oof
          - woof
      - x
        - =pinChildren
          - true
        - y
          - y1
        - z
    
  13. Nested Tables

    - a
      - =view
        - Table
      - b
        - =view
          - Table
        - b1
          - x
        - b2
          - y
    

Tips and Tricks

Database operations and fake timers

It looks like we must use fake timers if we want the store state to be updated based on database operations (e.g., if we use initialize() to reload the state). I think this is because the thoughtspace operations are asynchronous and don't call the store operations prior to the test ending. (I'm not sure why we didn't get other errors that made this clear.)

https://github.com/cybersemics/em/pull/2741

// Use fake timers here to ensure that the store operations run after loading into the db
vi.useFakeTimers()
await initialize()
await vi.runAllTimersAsync()

Triggering GitHub Actions workflows manually

In the event of a flaky GitHub Actions workflow, it can be useful to manually trigger multiple runs to flush out failures. The following shell function can be used to automate this process:

ghworkflow() {
  # get repo url
  repo_default=$(git remote get-url origin)
  workflow_default="puppeteer.yml"
  branch_default=$(git rev-parse --abbrev-ref HEAD)

  # prompt user for the repo
  read -p "Repository: ($repo_default) " input_repo
  repo=${input_repo:-$repo_default}

  # prompt the user for the workflow
  read -p "Workflow: ($workflow_default) " input_workflow
  workflow=${input_workflow:-$workflow_default}

  # prompt the user for the branch
  read -p "Branch: ($branch_default) " input_branch
  branch=${input_branch:-$branch_default}

  # prompt the user for the number of runs
  read -p "Number of runs: (10) " input_runs
  runs=${input_runs:-10}

  # To trigger the workflow on a PR from a fork, we need to push it to a repo we control.
  git push origin "$branch"

  for i in $(seq 1 $runs); do
    echo "Triggering workflow run #$i..."

    gh workflow run "$workflow" \
      --repo "$repo" \
      --ref "$branch" \
      --field rerun_id="run_$i"

    # avoid flooding GitHub API
    sleep 1
  done
}

Aside: workflow_dispatch must be enabled to allow manual workflow triggers.

This is already set on all the em workflows, so you shouldn't need to worry about it.

on:
  workflow_dispatch:
    inputs:
      rerun_id:
        description: 'Optional ID for tracking repeated runs'
        required: false

Identifying regressions with git bisect

git bisect performs a binary search over a range of commits between a known good state (no bug) and a known bad state (bug) to efficiently find the first commit that introduced a regression. Identifying the exact commit will provide a vital clue about the cause of the bug and will inform the solution.

Finding the beginning of the search range is somewhat arbitrary. If you know that a regression was introduced very recently, sometimes you can just go back a few weeks. Otherwise you should go back far enough to ensure that you find the good commit (before the regression was introduced). I recommend 1–2 years. It’ll quickly pare down when the search space is cut in half each time (i.e. log2 of n, where n is the number of commits). Any longer than a couple years and the codebase will have changed so much that it will be slow/difficult to install old versions of everything and recreate the environment. If the regression is that old, it probably requires approaching it as a novel bug anyway as the code has changed so much, it would be impossible to git revert.

Once you identify the good commit (hopefully on the first attempt), run git bisect good and git will take over from there, automatically checking out the next commit until it has narrowed down the source of the problem.

Your only job at each step is:

  1. yarn install
  2. Restart dev server if halted.
  3. Test for the regression.
  4. Run git bisect bad if the regression is still present and git bisect good if it is gone.

Record the commit hash it gives you at the very end and you’ve found the source of the regression! Often I take one more step of testing the bad commit again and the commit right before it (should be good) just to be extra sure. If any good/bad determination was mistaken along the way then it will throw off the whole process and the final result will not be accurate. But if you are precise and methodical, you can search through hundreds of commits in a matter of minutes to find the offending commit.

Best Practices

  • Avoid coupling Puppeteer tests to Redux state or other implementation details. e.g.

    The use of em.testHelpers.getState is tightly coupling the test to various parts of the Redux state (implementation details), which we really want to avoid. It's important that integration tests behave like a normal user and do not have access to what is "under the hood."

    The few times we add a backdoor in existing tests are as last resorts, when there is no other way to test something. Now that we have dedicated test engineers, we need to maintain high standards and work hard to promote separation of concerns and maintainability.

    https://github.com/cybersemics/em/pull/3172#discussion_r2274819907

  • No arbitrary sleep; instead wait for a specific condition