Skip to content

Data Engine datasource scanner fails for branches with '/' in name #648

@mcbeaker

Description

@mcbeaker

The Data Engine datasource scanner returns 0 files for repository-type datasources when the branch name contains a / character (e.g.,
feature/my-branch). The same datasource path works correctly when using a branch without /.

Reproduction Steps

  1. Create two branches from the same commit:

    • test/slash-branch (with /)
    • test-no-slash (without /)
  2. Create datasources pointing to the same DVC-tracked path on each
    branch:

    from dagshub.data_engine.datasources import create_datasource            
                                                                             
    # Branch WITH slash                                                      
    ds_slash = create_datasource(                                            
        repo='<owner>/<repo>',                                               
        name='test-with-slash',                                              
        path='data/my-folder',                                               
        revision='test/slash-branch'                                         
    )                                                                        
                                                                             
    # Branch WITHOUT slash                                                   
    ds_no_slash = create_datasource(                                         
        repo='<owner>/<repo>',                                               
        name='test-without-slash',                                           
        path='data/my-folder',                                               
        revision='test-no-slash'                                             
    )                                                                        
  3. Wait for scanner to complete and check results:

    from dagshub.data_engine import datasources                              
                                                                             
    ds_slash = datasources.get_datasource('<owner>/<repo>',  'test-with-slash')                                                          
    ds_no_slash = datasources.get_datasource('<owner>/<repo>', 'test-without-slash')                                                       
                                                                             
    print(f'test/slash-branch: {len(ds_slash.all())} files')                 
    print(f'test-no-slash: {len(ds_no_slash.all())} files')                  

Expected Behavior

Both datasources should return the same file count (the actual count in
that directory).

Actual Behavior

Branch Files Found
test/slash-branch 0
test-no-slash 100

Additional Evidence

Tested with multiple branch naming patterns:

Branch Name Datasource Files
feature/add-events-ts-support 0
test/slash-branch 0
events-ts-test 100
test-no-slash 100

All branches point to the same commit with identical data.dvc and
.dvc/config.

Environment

  • DVC remote: External S3 bucket connected to DagsHub
  • Files are DVC-tracked and accessible via DagsHub content API (verified)
  • Datasource type: REPOSITORY
  • preprocessing_status: PreprocessingStatus.READY (scanner completes,
    just returns 0 files)

Workaround

Use branch names without / characters (e.g., feature-my-branch
instead of feature/my-branch).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions