LMS DWH ETL Pipeline

SAP SuccessFactors Learning → SQL Server Dimensional Model

An automated ETL pipeline that ingests SAP SuccessFactors Learning CSV exports into a fully normalized SQL Server Data Warehouse, powering downstream Power BI reporting.

Architecture Overview

SAP SuccessFactors (CSV exports)
        │
        ▼
 ┌─────────────────────────────────────────────┐
 │  Python ETL  (lms_dwh/)                     │
 │  ┌─────────────────────────────────────┐    │
 │  │  Loaders (5 CSV sources)            │    │
 │  │  Category / HeadCount / Item /      │    │
 │  │  Learning Plan / Learning History   │    │
 │  └────────────────┬────────────────────┘    │
 │                   │ bulk insert              │
 └───────────────────┼──────────────────────────┘
                     ▼
 ┌─────────────────────────────────────────────┐
 │  SQL Server  (LMS_DWH)                      │
 │                                             │
 │  staging.*          →   dwh.*               │
 │  stg_category       →   d_category (SCD1)  │
 │  stg_headcount      →   d_user (SCD1)      │
 │                     →   d_user_hierarchy   │
 │  stg_item_data      →   d_item (SCD2)      │
 │  stg_learning_plan  →   f_learning_        │
 │                         assignment         │
 │  stg_learning_      →   f_learning_        │
 │  history                completion         │
 └─────────────────────────────────────────────┘
                     │
                     ▼
            Power BI Reports

Prerequisites

Requirement	Version	Notes
Python	3.9+
SQL Server	2017+ or SQL Server Express	Tested on `INLT3178\SQLEXPRESS`
ODBC Driver 17 for SQL Server	17.x	Download
Windows Authentication	Required	Pipeline uses `Trusted_Connection=yes` by default

Installation

1. Clone the repository

git clone https://github.com/OmprakashShyamalan/Successfactor_Report_Rebuilt.git
cd Successfactor_Report_Rebuilt

2. Install Python dependencies

pip install -r requirements.txt

Only one external dependency: pyodbc for SQL Server connectivity.

3. Deploy the SQL Server schema

Run the four SQL scripts in order against your LMS_DWH database. Open SSMS or use sqlcmd:

# Step 1: Schemas, audit tables, staging tables
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 01_ddl_schemas_staging_audit.sql

# Step 2: Dimension and fact table DDL
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 02_ddl_dwh_dimensions_facts.sql

# Step 3: Mart views
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 03_transformations_and_views.sql

# Step 4: Stored procedures (transform SPs called by the pipeline)
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 04_stored_procedures.sql

Note: Scripts must be run in order. Each script depends on objects from the previous one.

4. Configure the connection

The pipeline reads configuration from environment variables with sensible defaults.

Variable	Default	Description
`LMS_DB_SERVER`	`INLT3178\SQLEXPRESS`	SQL Server instance
`LMS_DB_NAME`	`LMS_DWH`	Database name
`LMS_DB_DRIVER`	`ODBC Driver 17 for SQL Server`	ODBC driver name
`LMS_DB_TRUSTED`	`yes`	Use Windows Auth (`yes`) or SQL Auth (`no`)

Option A — Set environment variables (recommended for automation):

# Windows Command Prompt
set LMS_DB_SERVER=YOUR_SERVER\INSTANCE
set LMS_DB_NAME=LMS_DWH

# Windows PowerShell
$env:LMS_DB_SERVER = "YOUR_SERVER\INSTANCE"
$env:LMS_DB_NAME   = "LMS_DWH"

Option B — Edit lms_dwh/config.py directly (for local dev):

_server   = r"YOUR_SERVER\INSTANCE"
_database = "LMS_DWH"

Option C — Pre-configure CSV file paths (for --use-config mode):

set LMS_FILE_CATEGORY=C:\exports\Category.csv
set LMS_FILE_HEADCOUNT=C:\exports\HeadCount.csv
set LMS_FILE_ITEM=C:\exports\ItemDataCSV.csv
set LMS_FILE_LEARNING_PLAN=C:\exports\LearningPlan.csv
set LMS_FILE_LEARNING_HISTORY=C:\exports\Learning_History.csv

Running the Pipeline

All commands are run from inside the lms_dwh/ directory:

cd lms_dwh

Full Pipeline (all files)

python main.py \
    --category         "C:/exports/Category.csv" \
    --headcount        "C:/exports/HeadCount.csv" \
    --item             "C:/exports/ItemDataCSV.csv" \
    --learning-plan    "C:/exports/LearningPlan.csv" \
    --learning-history "C:/exports/Learning_History.csv" \
    --triggered-by     "Omprakash.G"

Partial Run (any subset of files)

You can pass any combination of file arguments. Only stages with a valid file path will execute.

# HeadCount + Item only (e.g. org re-org with no learning changes)
python main.py \
    --headcount "C:/exports/HeadCount.csv" \
    --item      "C:/exports/ItemDataCSV.csv"

Category Only (incremental, dedup-safe)

Category is exported separately from SharePoint and is incremental — duplicates are automatically skipped.

python main.py --category "C:/exports/Category.csv" --category-only

Use Pre-configured Paths

If you have set LMS_FILE_* environment variables:

python main.py --use-config --triggered-by "Scheduled_Job"

Load Order & What Each Step Does

Stage 0  : Truncate staging tables (headcount/item/lp/lh)
           → Ensures staging always reflects only the current CSV export

Stage 1  : Category CSV  → staging.stg_category
           → usp_load_d_category   (MERGE into d_category, SCD1)

Stage 2  : HeadCount CSV → staging.stg_headcount
           → Two-pass: Pass 1 builds supervisor name→emp_id lookup
                       Pass 2 resolves each row's supervisor
           → usp_load_d_user          (MERGE into d_user, SCD1)
           → usp_rebuild_user_hierarchy (TRUNCATE + rebuild org tree, 12 levels)

Stage 3  : Item Data CSV → staging.stg_item_data
           → usp_load_d_item  (SCD Type 2: new revision = new row)

Stage 4  : Learning Plan CSV → staging.stg_learning_plan
           → Pre-scan: resolve unknown user IDs (insert as External)
           → usp_load_f_assignment  (MERGE into f_learning_assignment)

Stage 5  : Learning History CSV → staging.stg_learning_history
           → Pre-scan: resolve unknown user IDs
           → usp_load_f_completion_new         (new completion records)
           → usp_load_f_completion_corrections (supersede older records)
           → usp_quarantine_completion          (quarantine bad refs)
           → usp_refresh_completion_flags       (update is_superseded flags)
           → usp_resolve_quarantine             (retry quarantine on current data)

Terminal Output Example

============================================================
LMS DWH Pipeline Starting
============================================================
  [OK] Staging tables truncated

[Step 1] Category
  [########--------------------]  40.0%       636/1,590 read        636 inserted  Category.csv
  [DONE] 159 unique rows -> staging.stg_category

[Step 2] HeadCount
  Pass 1: building supervisor name -> emp_id lookup ...
  Lookup built: 14,078 employees indexed
  [############################] 100.0%    14,078/14,078 read     14,078 inserted
  [DONE] 14,078/14,078 rows  |  1,764 supervisor names unresolved (NULL)

...
Pipeline Complete.
============================================================

CSV Source Files

Place CSV exports from SAP SuccessFactors / SharePoint in any folder and pass the path via CLI arguments.

File	Source	~Size	Frequency
`Category.csv`	SharePoint list export	~6 KB	On change
`HeadCount.csv`	SAP SuccessFactors People	~5 MB	Monthly
`ItemDataCSV.csv`	SAP LMS Item Data report	~27 MB	Monthly
`LearningPlan.csv`	SAP LMS Learning Plan report	~177 MB	Monthly
`Learning_History.csv`	SAP LMS Learning History report	~8 MB	Monthly

CSV files are excluded from this repository via .gitignore. Store them on a network share or SharePoint.

Expected CSV Headers

Category.csv: Group, Sub-group, Subject Area

HeadCount.csv: Global Employee Number, First Name, Last Name, Department, Division, Group, Job Family, Job Function, Job Name, Job Title (Local), Legal Entity, Location (Office), Product, Job Level, Supervisor, Business Email Information Email Address, Location (Office) Country, Employment Details Hire Date (Sapiens), Employment Details Original Start Date (Merge/Acquisition/Relocation), Gender, Home Phone Information Phone Number, Employee Type, Employee Status, Private Cell Formatted Phone Number, Other Formatted Phone Number

ItemDataCSV.csv: Item Type, Item ID, Item Revision Date, Revision Number, Item Title, Active, Duration (hours), Category ID, Category Description (+ 52 more columns)

LearningPlan.csv: User ID, Item Type, Entity ID, Revision Number, Completion Status, Assignment Type, Assigned By, Assigned By (ID), Assigned Date, Complete Date, Failure Date, Curriculum ID (+ 26 more columns)

Learning_History.csv: User, Item Type, Entity ID, Revision Number, Completion Status (ID), Completion Status, Completion Date, Score (+ 35 more columns)

Dimensional Model Reference

Table	Type	Description
`dwh.d_date`	Dimension	Date spine 2000–2040, INT key `YYYYMMDD`
`dwh.d_category`	SCD1	Training category hierarchy (Group → Subgroup → Subject Area)
`dwh.d_user`	SCD1	Employees + auto-inserted external users
`dwh.d_user_hierarchy`	Snapshot	12-level flattened org chart, rebuilt each HeadCount load
`dwh.d_item`	SCD2	Course/item catalogue; each revision = new row
`dwh.f_learning_assignment`	Fact	One row per user × item × assignment date (from Learning Plan)
`dwh.f_learning_completion`	Fact	Completion records with correction model (`is_superseded` flag)

Querying the Hierarchy

-- All completions for everyone under a specific manager
SELECT u.first_name + ' ' + u.last_name AS employee, i.item_title, f.completion_date_sk
FROM dwh.f_learning_completion f
JOIN dwh.d_user           u   ON u.user_sk  = f.user_sk
JOIN dwh.d_user_hierarchy h   ON h.user_sk  = f.user_sk
JOIN dwh.d_user           mgr ON mgr.user_id = '706620'   -- manager emp_id
JOIN dwh.d_item           i   ON i.item_sk   = f.item_sk
WHERE mgr.user_sk IN (h.level_1_sk, h.level_2_sk, h.level_3_sk,
                      h.level_4_sk, h.level_5_sk, h.level_6_sk);

Audit & Monitoring

Every pipeline run writes to audit.batch_control:

SELECT batch_id, source_file_name, source_table, status,
       rows_read, rows_inserted, rows_quarantined,
       batch_start_ts, batch_end_ts, triggered_by
FROM audit.batch_control
ORDER BY batch_id DESC;

User ID mismatches (IDs in learning files not found in HeadCount) are logged to audit.user_id_mismatch. These users are automatically inserted into d_user as employee_type = 'External'.

Known Data Notes

Item	Detail
SAP date format	Datetime fields include `Universal` suffix (e.g. `4/24/2025 11:30:00 Universal`). Stripped in SP before casting.
Supervisor column	HeadCount CSV contains `"Lastname, Firstname"` — resolved to employee ID via two-pass lookup against same file. ~12.5% may remain NULL (top-level executives or name mismatches).
Category gaps	Items with `Category Description` = `P&C`, `L&P`, `Sapiens Israel` etc. have no entry in the Category CSV. Add them to the category export to resolve.
External users	User IDs found in Learning Plan/History but absent from HeadCount are auto-created as External users in `d_user`.
Progress bar	Chunked loading shows live `#---` progress bar and % in terminal. Chunk size = 10,000 rows.

Project Structure

Successfactor_Report_Rebuilt/
├── README.md                              # This file
├── requirements.txt                       # Python dependencies
├── .gitignore                             # Excludes CSV files and pycache
│
├── 01_ddl_schemas_staging_audit.sql       # Schemas, staging tables, audit tables
├── 02_ddl_dwh_dimensions_facts.sql        # Dimension & fact table DDL, d_date spine
├── 03_transformations_and_views.sql       # Mart views for Power BI
├── 04_stored_procedures.sql               # All transform stored procedures
│
└── lms_dwh/                               # Python ETL package
    ├── main.py                            # CLI entry point (argparse)
    ├── config.py                          # DB connection & file path config
    ├── db.py                              # pyodbc helpers (connect, bulk_insert)
    ├── batch.py                           # Audit batch context manager
    ├── normalizer.py                      # User ID normalization & external user logic
    ├── pipeline.py                        # Pipeline orchestration
    ├── loaders/
    │   ├── base_loader.py                 # Chunked CSV loader base class
    │   ├── category_loader.py             # Category (dedup, incremental)
    │   ├── headcount_loader.py            # HeadCount (two-pass supervisor resolve)
    │   ├── item_loader.py                 # Item Data (61 columns, SCD2 feed)
    │   ├── learning_plan_loader.py        # Learning Plan (39 columns)
    │   └── learning_history_loader.py     # Learning History (43 columns)
    └── transforms/
        └── executor.py                    # Calls EXEC dwh.usp_load_* in order

Troubleshooting

pyodbc.Error: IM002 Data source name not found Install ODBC Driver 17: https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server

Login failed for user Ensure your Windows account has db_datareader + db_datawriter + db_ddladmin roles on LMS_DWH, or is sysadmin on the instance.

Invalid column name 'item_id' in quarantine SP Known bug in 01_ddl_schemas_staging_audit.sql — index ix_quarantine_item references item_id but the quarantine table uses entity_id. Drop and recreate the index if it exists.

Progress bar shows wrong total (e.g. 75,265 instead of 39,024 for Item Data) Fixed in current version — _count_rows() now uses csv.reader to correctly handle quoted fields containing embedded newlines.

UPDATE failed because SET options have incorrect settings: QUOTED_IDENTIFIER The usp_load_d_item stored procedure must be created with SET QUOTED_IDENTIFIER ON in effect. The 04_stored_procedures.sql file already includes this. Re-run the file if you see this error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LMS DWH ETL Pipeline

SAP SuccessFactors Learning → SQL Server Dimensional Model

Architecture Overview

Prerequisites

Installation

1. Clone the repository

2. Install Python dependencies

3. Deploy the SQL Server schema

4. Configure the connection

Running the Pipeline

Full Pipeline (all files)

Partial Run (any subset of files)

Category Only (incremental, dedup-safe)

Use Pre-configured Paths

Load Order & What Each Step Does

Terminal Output Example

CSV Source Files

Expected CSV Headers

Dimensional Model Reference

Querying the Hierarchy

Audit & Monitoring

Known Data Notes

Project Structure

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
lms_dashboard		lms_dashboard
lms_dwh		lms_dwh
.gitignore		.gitignore
01_ddl_schemas_staging_audit.sql		01_ddl_schemas_staging_audit.sql
02_ddl_dwh_dimensions_facts.sql		02_ddl_dwh_dimensions_facts.sql
03_transformations_and_views.sql		03_transformations_and_views.sql
04_stored_procedures.sql		04_stored_procedures.sql
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LMS DWH ETL Pipeline

SAP SuccessFactors Learning → SQL Server Dimensional Model

Architecture Overview

Prerequisites

Installation

1. Clone the repository

2. Install Python dependencies

3. Deploy the SQL Server schema

4. Configure the connection

Running the Pipeline

Full Pipeline (all files)

Partial Run (any subset of files)

Category Only (incremental, dedup-safe)

Use Pre-configured Paths

Load Order & What Each Step Does

Terminal Output Example

CSV Source Files

Expected CSV Headers

Dimensional Model Reference

Querying the Hierarchy

Audit & Monitoring

Known Data Notes

Project Structure

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages