An automated ETL pipeline that ingests SAP SuccessFactors Learning CSV exports into a fully normalized SQL Server Data Warehouse, powering downstream Power BI reporting.
SAP SuccessFactors (CSV exports)
│
▼
┌─────────────────────────────────────────────┐
│ Python ETL (lms_dwh/) │
│ ┌─────────────────────────────────────┐ │
│ │ Loaders (5 CSV sources) │ │
│ │ Category / HeadCount / Item / │ │
│ │ Learning Plan / Learning History │ │
│ └────────────────┬────────────────────┘ │
│ │ bulk insert │
└───────────────────┼──────────────────────────┘
▼
┌─────────────────────────────────────────────┐
│ SQL Server (LMS_DWH) │
│ │
│ staging.* → dwh.* │
│ stg_category → d_category (SCD1) │
│ stg_headcount → d_user (SCD1) │
│ → d_user_hierarchy │
│ stg_item_data → d_item (SCD2) │
│ stg_learning_plan → f_learning_ │
│ assignment │
│ stg_learning_ → f_learning_ │
│ history completion │
└─────────────────────────────────────────────┘
│
▼
Power BI Reports
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.9+ | |
| SQL Server | 2017+ or SQL Server Express | Tested on INLT3178\SQLEXPRESS |
| ODBC Driver 17 for SQL Server | 17.x | Download |
| Windows Authentication | Required | Pipeline uses Trusted_Connection=yes by default |
git clone https://github.com/OmprakashShyamalan/Successfactor_Report_Rebuilt.git
cd Successfactor_Report_Rebuiltpip install -r requirements.txtOnly one external dependency: pyodbc for SQL Server connectivity.
Run the four SQL scripts in order against your LMS_DWH database.
Open SSMS or use sqlcmd:
# Step 1: Schemas, audit tables, staging tables
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 01_ddl_schemas_staging_audit.sql
# Step 2: Dimension and fact table DDL
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 02_ddl_dwh_dimensions_facts.sql
# Step 3: Mart views
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 03_transformations_and_views.sql
# Step 4: Stored procedures (transform SPs called by the pipeline)
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 04_stored_procedures.sqlNote: Scripts must be run in order. Each script depends on objects from the previous one.
The pipeline reads configuration from environment variables with sensible defaults.
| Variable | Default | Description |
|---|---|---|
LMS_DB_SERVER |
INLT3178\SQLEXPRESS |
SQL Server instance |
LMS_DB_NAME |
LMS_DWH |
Database name |
LMS_DB_DRIVER |
ODBC Driver 17 for SQL Server |
ODBC driver name |
LMS_DB_TRUSTED |
yes |
Use Windows Auth (yes) or SQL Auth (no) |
Option A — Set environment variables (recommended for automation):
# Windows Command Prompt
set LMS_DB_SERVER=YOUR_SERVER\INSTANCE
set LMS_DB_NAME=LMS_DWH
# Windows PowerShell
$env:LMS_DB_SERVER = "YOUR_SERVER\INSTANCE"
$env:LMS_DB_NAME = "LMS_DWH"Option B — Edit lms_dwh/config.py directly (for local dev):
_server = r"YOUR_SERVER\INSTANCE"
_database = "LMS_DWH"Option C — Pre-configure CSV file paths (for --use-config mode):
set LMS_FILE_CATEGORY=C:\exports\Category.csv
set LMS_FILE_HEADCOUNT=C:\exports\HeadCount.csv
set LMS_FILE_ITEM=C:\exports\ItemDataCSV.csv
set LMS_FILE_LEARNING_PLAN=C:\exports\LearningPlan.csv
set LMS_FILE_LEARNING_HISTORY=C:\exports\Learning_History.csvAll commands are run from inside the lms_dwh/ directory:
cd lms_dwhpython main.py \
--category "C:/exports/Category.csv" \
--headcount "C:/exports/HeadCount.csv" \
--item "C:/exports/ItemDataCSV.csv" \
--learning-plan "C:/exports/LearningPlan.csv" \
--learning-history "C:/exports/Learning_History.csv" \
--triggered-by "Omprakash.G"You can pass any combination of file arguments. Only stages with a valid file path will execute.
# HeadCount + Item only (e.g. org re-org with no learning changes)
python main.py \
--headcount "C:/exports/HeadCount.csv" \
--item "C:/exports/ItemDataCSV.csv"Category is exported separately from SharePoint and is incremental — duplicates are automatically skipped.
python main.py --category "C:/exports/Category.csv" --category-onlyIf you have set LMS_FILE_* environment variables:
python main.py --use-config --triggered-by "Scheduled_Job"Stage 0 : Truncate staging tables (headcount/item/lp/lh)
→ Ensures staging always reflects only the current CSV export
Stage 1 : Category CSV → staging.stg_category
→ usp_load_d_category (MERGE into d_category, SCD1)
Stage 2 : HeadCount CSV → staging.stg_headcount
→ Two-pass: Pass 1 builds supervisor name→emp_id lookup
Pass 2 resolves each row's supervisor
→ usp_load_d_user (MERGE into d_user, SCD1)
→ usp_rebuild_user_hierarchy (TRUNCATE + rebuild org tree, 12 levels)
Stage 3 : Item Data CSV → staging.stg_item_data
→ usp_load_d_item (SCD Type 2: new revision = new row)
Stage 4 : Learning Plan CSV → staging.stg_learning_plan
→ Pre-scan: resolve unknown user IDs (insert as External)
→ usp_load_f_assignment (MERGE into f_learning_assignment)
Stage 5 : Learning History CSV → staging.stg_learning_history
→ Pre-scan: resolve unknown user IDs
→ usp_load_f_completion_new (new completion records)
→ usp_load_f_completion_corrections (supersede older records)
→ usp_quarantine_completion (quarantine bad refs)
→ usp_refresh_completion_flags (update is_superseded flags)
→ usp_resolve_quarantine (retry quarantine on current data)
============================================================
LMS DWH Pipeline Starting
============================================================
[OK] Staging tables truncated
[Step 1] Category
[########--------------------] 40.0% 636/1,590 read 636 inserted Category.csv
[DONE] 159 unique rows -> staging.stg_category
[Step 2] HeadCount
Pass 1: building supervisor name -> emp_id lookup ...
Lookup built: 14,078 employees indexed
[############################] 100.0% 14,078/14,078 read 14,078 inserted
[DONE] 14,078/14,078 rows | 1,764 supervisor names unresolved (NULL)
...
Pipeline Complete.
============================================================
Place CSV exports from SAP SuccessFactors / SharePoint in any folder and pass the path via CLI arguments.
| File | Source | ~Size | Frequency |
|---|---|---|---|
Category.csv |
SharePoint list export | ~6 KB | On change |
HeadCount.csv |
SAP SuccessFactors People | ~5 MB | Monthly |
ItemDataCSV.csv |
SAP LMS Item Data report | ~27 MB | Monthly |
LearningPlan.csv |
SAP LMS Learning Plan report | ~177 MB | Monthly |
Learning_History.csv |
SAP LMS Learning History report | ~8 MB | Monthly |
CSV files are excluded from this repository via
.gitignore. Store them on a network share or SharePoint.
Category.csv: Group, Sub-group, Subject Area
HeadCount.csv: Global Employee Number, First Name, Last Name, Department, Division, Group, Job Family, Job Function, Job Name, Job Title (Local), Legal Entity, Location (Office), Product, Job Level, Supervisor, Business Email Information Email Address, Location (Office) Country, Employment Details Hire Date (Sapiens), Employment Details Original Start Date (Merge/Acquisition/Relocation), Gender, Home Phone Information Phone Number, Employee Type, Employee Status, Private Cell Formatted Phone Number, Other Formatted Phone Number
ItemDataCSV.csv: Item Type, Item ID, Item Revision Date, Revision Number, Item Title, Active, Duration (hours), Category ID, Category Description (+ 52 more columns)
LearningPlan.csv: User ID, Item Type, Entity ID, Revision Number, Completion Status, Assignment Type, Assigned By, Assigned By (ID), Assigned Date, Complete Date, Failure Date, Curriculum ID (+ 26 more columns)
Learning_History.csv: User, Item Type, Entity ID, Revision Number, Completion Status (ID), Completion Status, Completion Date, Score (+ 35 more columns)
| Table | Type | Description |
|---|---|---|
dwh.d_date |
Dimension | Date spine 2000–2040, INT key YYYYMMDD |
dwh.d_category |
SCD1 | Training category hierarchy (Group → Subgroup → Subject Area) |
dwh.d_user |
SCD1 | Employees + auto-inserted external users |
dwh.d_user_hierarchy |
Snapshot | 12-level flattened org chart, rebuilt each HeadCount load |
dwh.d_item |
SCD2 | Course/item catalogue; each revision = new row |
dwh.f_learning_assignment |
Fact | One row per user × item × assignment date (from Learning Plan) |
dwh.f_learning_completion |
Fact | Completion records with correction model (is_superseded flag) |
-- All completions for everyone under a specific manager
SELECT u.first_name + ' ' + u.last_name AS employee, i.item_title, f.completion_date_sk
FROM dwh.f_learning_completion f
JOIN dwh.d_user u ON u.user_sk = f.user_sk
JOIN dwh.d_user_hierarchy h ON h.user_sk = f.user_sk
JOIN dwh.d_user mgr ON mgr.user_id = '706620' -- manager emp_id
JOIN dwh.d_item i ON i.item_sk = f.item_sk
WHERE mgr.user_sk IN (h.level_1_sk, h.level_2_sk, h.level_3_sk,
h.level_4_sk, h.level_5_sk, h.level_6_sk);Every pipeline run writes to audit.batch_control:
SELECT batch_id, source_file_name, source_table, status,
rows_read, rows_inserted, rows_quarantined,
batch_start_ts, batch_end_ts, triggered_by
FROM audit.batch_control
ORDER BY batch_id DESC;User ID mismatches (IDs in learning files not found in HeadCount) are logged to audit.user_id_mismatch. These users are automatically inserted into d_user as employee_type = 'External'.
| Item | Detail |
|---|---|
| SAP date format | Datetime fields include Universal suffix (e.g. 4/24/2025 11:30:00 Universal). Stripped in SP before casting. |
| Supervisor column | HeadCount CSV contains "Lastname, Firstname" — resolved to employee ID via two-pass lookup against same file. ~12.5% may remain NULL (top-level executives or name mismatches). |
| Category gaps | Items with Category Description = P&C, L&P, Sapiens Israel etc. have no entry in the Category CSV. Add them to the category export to resolve. |
| External users | User IDs found in Learning Plan/History but absent from HeadCount are auto-created as External users in d_user. |
| Progress bar | Chunked loading shows live #--- progress bar and % in terminal. Chunk size = 10,000 rows. |
Successfactor_Report_Rebuilt/
├── README.md # This file
├── requirements.txt # Python dependencies
├── .gitignore # Excludes CSV files and pycache
│
├── 01_ddl_schemas_staging_audit.sql # Schemas, staging tables, audit tables
├── 02_ddl_dwh_dimensions_facts.sql # Dimension & fact table DDL, d_date spine
├── 03_transformations_and_views.sql # Mart views for Power BI
├── 04_stored_procedures.sql # All transform stored procedures
│
└── lms_dwh/ # Python ETL package
├── main.py # CLI entry point (argparse)
├── config.py # DB connection & file path config
├── db.py # pyodbc helpers (connect, bulk_insert)
├── batch.py # Audit batch context manager
├── normalizer.py # User ID normalization & external user logic
├── pipeline.py # Pipeline orchestration
├── loaders/
│ ├── base_loader.py # Chunked CSV loader base class
│ ├── category_loader.py # Category (dedup, incremental)
│ ├── headcount_loader.py # HeadCount (two-pass supervisor resolve)
│ ├── item_loader.py # Item Data (61 columns, SCD2 feed)
│ ├── learning_plan_loader.py # Learning Plan (39 columns)
│ └── learning_history_loader.py # Learning History (43 columns)
└── transforms/
└── executor.py # Calls EXEC dwh.usp_load_* in order
pyodbc.Error: IM002 Data source name not found
Install ODBC Driver 17: https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server
Login failed for user
Ensure your Windows account has db_datareader + db_datawriter + db_ddladmin roles on LMS_DWH, or is sysadmin on the instance.
Invalid column name 'item_id' in quarantine SP
Known bug in 01_ddl_schemas_staging_audit.sql — index ix_quarantine_item references item_id but the quarantine table uses entity_id. Drop and recreate the index if it exists.
Progress bar shows wrong total (e.g. 75,265 instead of 39,024 for Item Data)
Fixed in current version — _count_rows() now uses csv.reader to correctly handle quoted fields containing embedded newlines.
UPDATE failed because SET options have incorrect settings: QUOTED_IDENTIFIER
The usp_load_d_item stored procedure must be created with SET QUOTED_IDENTIFIER ON in effect. The 04_stored_procedures.sql file already includes this. Re-run the file if you see this error.