Skip to content

OmprakashShyamalan/Successfactor_Report_Rebuilt

Repository files navigation

LMS DWH ETL Pipeline

SAP SuccessFactors Learning → SQL Server Dimensional Model

An automated ETL pipeline that ingests SAP SuccessFactors Learning CSV exports into a fully normalized SQL Server Data Warehouse, powering downstream Power BI reporting.


Architecture Overview

SAP SuccessFactors (CSV exports)
        │
        ▼
 ┌─────────────────────────────────────────────┐
 │  Python ETL  (lms_dwh/)                     │
 │  ┌─────────────────────────────────────┐    │
 │  │  Loaders (5 CSV sources)            │    │
 │  │  Category / HeadCount / Item /      │    │
 │  │  Learning Plan / Learning History   │    │
 │  └────────────────┬────────────────────┘    │
 │                   │ bulk insert              │
 └───────────────────┼──────────────────────────┘
                     ▼
 ┌─────────────────────────────────────────────┐
 │  SQL Server  (LMS_DWH)                      │
 │                                             │
 │  staging.*          →   dwh.*               │
 │  stg_category       →   d_category (SCD1)  │
 │  stg_headcount      →   d_user (SCD1)      │
 │                     →   d_user_hierarchy   │
 │  stg_item_data      →   d_item (SCD2)      │
 │  stg_learning_plan  →   f_learning_        │
 │                         assignment         │
 │  stg_learning_      →   f_learning_        │
 │  history                completion         │
 └─────────────────────────────────────────────┘
                     │
                     ▼
            Power BI Reports

Prerequisites

Requirement Version Notes
Python 3.9+
SQL Server 2017+ or SQL Server Express Tested on INLT3178\SQLEXPRESS
ODBC Driver 17 for SQL Server 17.x Download
Windows Authentication Required Pipeline uses Trusted_Connection=yes by default

Installation

1. Clone the repository

git clone https://github.com/OmprakashShyamalan/Successfactor_Report_Rebuilt.git
cd Successfactor_Report_Rebuilt

2. Install Python dependencies

pip install -r requirements.txt

Only one external dependency: pyodbc for SQL Server connectivity.

3. Deploy the SQL Server schema

Run the four SQL scripts in order against your LMS_DWH database. Open SSMS or use sqlcmd:

# Step 1: Schemas, audit tables, staging tables
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 01_ddl_schemas_staging_audit.sql

# Step 2: Dimension and fact table DDL
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 02_ddl_dwh_dimensions_facts.sql

# Step 3: Mart views
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 03_transformations_and_views.sql

# Step 4: Stored procedures (transform SPs called by the pipeline)
sqlcmd -S "YOUR_SERVER\INSTANCE" -d LMS_DWH -E -i 04_stored_procedures.sql

Note: Scripts must be run in order. Each script depends on objects from the previous one.

4. Configure the connection

The pipeline reads configuration from environment variables with sensible defaults.

Variable Default Description
LMS_DB_SERVER INLT3178\SQLEXPRESS SQL Server instance
LMS_DB_NAME LMS_DWH Database name
LMS_DB_DRIVER ODBC Driver 17 for SQL Server ODBC driver name
LMS_DB_TRUSTED yes Use Windows Auth (yes) or SQL Auth (no)

Option A — Set environment variables (recommended for automation):

# Windows Command Prompt
set LMS_DB_SERVER=YOUR_SERVER\INSTANCE
set LMS_DB_NAME=LMS_DWH

# Windows PowerShell
$env:LMS_DB_SERVER = "YOUR_SERVER\INSTANCE"
$env:LMS_DB_NAME   = "LMS_DWH"

Option B — Edit lms_dwh/config.py directly (for local dev):

_server   = r"YOUR_SERVER\INSTANCE"
_database = "LMS_DWH"

Option C — Pre-configure CSV file paths (for --use-config mode):

set LMS_FILE_CATEGORY=C:\exports\Category.csv
set LMS_FILE_HEADCOUNT=C:\exports\HeadCount.csv
set LMS_FILE_ITEM=C:\exports\ItemDataCSV.csv
set LMS_FILE_LEARNING_PLAN=C:\exports\LearningPlan.csv
set LMS_FILE_LEARNING_HISTORY=C:\exports\Learning_History.csv

Running the Pipeline

All commands are run from inside the lms_dwh/ directory:

cd lms_dwh

Full Pipeline (all files)

python main.py \
    --category         "C:/exports/Category.csv" \
    --headcount        "C:/exports/HeadCount.csv" \
    --item             "C:/exports/ItemDataCSV.csv" \
    --learning-plan    "C:/exports/LearningPlan.csv" \
    --learning-history "C:/exports/Learning_History.csv" \
    --triggered-by     "Omprakash.G"

Partial Run (any subset of files)

You can pass any combination of file arguments. Only stages with a valid file path will execute.

# HeadCount + Item only (e.g. org re-org with no learning changes)
python main.py \
    --headcount "C:/exports/HeadCount.csv" \
    --item      "C:/exports/ItemDataCSV.csv"

Category Only (incremental, dedup-safe)

Category is exported separately from SharePoint and is incremental — duplicates are automatically skipped.

python main.py --category "C:/exports/Category.csv" --category-only

Use Pre-configured Paths

If you have set LMS_FILE_* environment variables:

python main.py --use-config --triggered-by "Scheduled_Job"

Load Order & What Each Step Does

Stage 0  : Truncate staging tables (headcount/item/lp/lh)
           → Ensures staging always reflects only the current CSV export

Stage 1  : Category CSV  → staging.stg_category
           → usp_load_d_category   (MERGE into d_category, SCD1)

Stage 2  : HeadCount CSV → staging.stg_headcount
           → Two-pass: Pass 1 builds supervisor name→emp_id lookup
                       Pass 2 resolves each row's supervisor
           → usp_load_d_user          (MERGE into d_user, SCD1)
           → usp_rebuild_user_hierarchy (TRUNCATE + rebuild org tree, 12 levels)

Stage 3  : Item Data CSV → staging.stg_item_data
           → usp_load_d_item  (SCD Type 2: new revision = new row)

Stage 4  : Learning Plan CSV → staging.stg_learning_plan
           → Pre-scan: resolve unknown user IDs (insert as External)
           → usp_load_f_assignment  (MERGE into f_learning_assignment)

Stage 5  : Learning History CSV → staging.stg_learning_history
           → Pre-scan: resolve unknown user IDs
           → usp_load_f_completion_new         (new completion records)
           → usp_load_f_completion_corrections (supersede older records)
           → usp_quarantine_completion          (quarantine bad refs)
           → usp_refresh_completion_flags       (update is_superseded flags)
           → usp_resolve_quarantine             (retry quarantine on current data)

Terminal Output Example

============================================================
LMS DWH Pipeline Starting
============================================================
  [OK] Staging tables truncated

[Step 1] Category
  [########--------------------]  40.0%       636/1,590 read        636 inserted  Category.csv
  [DONE] 159 unique rows -> staging.stg_category

[Step 2] HeadCount
  Pass 1: building supervisor name -> emp_id lookup ...
  Lookup built: 14,078 employees indexed
  [############################] 100.0%    14,078/14,078 read     14,078 inserted
  [DONE] 14,078/14,078 rows  |  1,764 supervisor names unresolved (NULL)

...
Pipeline Complete.
============================================================

CSV Source Files

Place CSV exports from SAP SuccessFactors / SharePoint in any folder and pass the path via CLI arguments.

File Source ~Size Frequency
Category.csv SharePoint list export ~6 KB On change
HeadCount.csv SAP SuccessFactors People ~5 MB Monthly
ItemDataCSV.csv SAP LMS Item Data report ~27 MB Monthly
LearningPlan.csv SAP LMS Learning Plan report ~177 MB Monthly
Learning_History.csv SAP LMS Learning History report ~8 MB Monthly

CSV files are excluded from this repository via .gitignore. Store them on a network share or SharePoint.

Expected CSV Headers

Category.csv: Group, Sub-group, Subject Area

HeadCount.csv: Global Employee Number, First Name, Last Name, Department, Division, Group, Job Family, Job Function, Job Name, Job Title (Local), Legal Entity, Location (Office), Product, Job Level, Supervisor, Business Email Information Email Address, Location (Office) Country, Employment Details Hire Date (Sapiens), Employment Details Original Start Date (Merge/Acquisition/Relocation), Gender, Home Phone Information Phone Number, Employee Type, Employee Status, Private Cell Formatted Phone Number, Other Formatted Phone Number

ItemDataCSV.csv: Item Type, Item ID, Item Revision Date, Revision Number, Item Title, Active, Duration (hours), Category ID, Category Description (+ 52 more columns)

LearningPlan.csv: User ID, Item Type, Entity ID, Revision Number, Completion Status, Assignment Type, Assigned By, Assigned By (ID), Assigned Date, Complete Date, Failure Date, Curriculum ID (+ 26 more columns)

Learning_History.csv: User, Item Type, Entity ID, Revision Number, Completion Status (ID), Completion Status, Completion Date, Score (+ 35 more columns)


Dimensional Model Reference

Table Type Description
dwh.d_date Dimension Date spine 2000–2040, INT key YYYYMMDD
dwh.d_category SCD1 Training category hierarchy (Group → Subgroup → Subject Area)
dwh.d_user SCD1 Employees + auto-inserted external users
dwh.d_user_hierarchy Snapshot 12-level flattened org chart, rebuilt each HeadCount load
dwh.d_item SCD2 Course/item catalogue; each revision = new row
dwh.f_learning_assignment Fact One row per user × item × assignment date (from Learning Plan)
dwh.f_learning_completion Fact Completion records with correction model (is_superseded flag)

Querying the Hierarchy

-- All completions for everyone under a specific manager
SELECT u.first_name + ' ' + u.last_name AS employee, i.item_title, f.completion_date_sk
FROM dwh.f_learning_completion f
JOIN dwh.d_user           u   ON u.user_sk  = f.user_sk
JOIN dwh.d_user_hierarchy h   ON h.user_sk  = f.user_sk
JOIN dwh.d_user           mgr ON mgr.user_id = '706620'   -- manager emp_id
JOIN dwh.d_item           i   ON i.item_sk   = f.item_sk
WHERE mgr.user_sk IN (h.level_1_sk, h.level_2_sk, h.level_3_sk,
                      h.level_4_sk, h.level_5_sk, h.level_6_sk);

Audit & Monitoring

Every pipeline run writes to audit.batch_control:

SELECT batch_id, source_file_name, source_table, status,
       rows_read, rows_inserted, rows_quarantined,
       batch_start_ts, batch_end_ts, triggered_by
FROM audit.batch_control
ORDER BY batch_id DESC;

User ID mismatches (IDs in learning files not found in HeadCount) are logged to audit.user_id_mismatch. These users are automatically inserted into d_user as employee_type = 'External'.


Known Data Notes

Item Detail
SAP date format Datetime fields include Universal suffix (e.g. 4/24/2025 11:30:00 Universal). Stripped in SP before casting.
Supervisor column HeadCount CSV contains "Lastname, Firstname" — resolved to employee ID via two-pass lookup against same file. ~12.5% may remain NULL (top-level executives or name mismatches).
Category gaps Items with Category Description = P&C, L&P, Sapiens Israel etc. have no entry in the Category CSV. Add them to the category export to resolve.
External users User IDs found in Learning Plan/History but absent from HeadCount are auto-created as External users in d_user.
Progress bar Chunked loading shows live #--- progress bar and % in terminal. Chunk size = 10,000 rows.

Project Structure

Successfactor_Report_Rebuilt/
├── README.md                              # This file
├── requirements.txt                       # Python dependencies
├── .gitignore                             # Excludes CSV files and pycache
│
├── 01_ddl_schemas_staging_audit.sql       # Schemas, staging tables, audit tables
├── 02_ddl_dwh_dimensions_facts.sql        # Dimension & fact table DDL, d_date spine
├── 03_transformations_and_views.sql       # Mart views for Power BI
├── 04_stored_procedures.sql               # All transform stored procedures
│
└── lms_dwh/                               # Python ETL package
    ├── main.py                            # CLI entry point (argparse)
    ├── config.py                          # DB connection & file path config
    ├── db.py                              # pyodbc helpers (connect, bulk_insert)
    ├── batch.py                           # Audit batch context manager
    ├── normalizer.py                      # User ID normalization & external user logic
    ├── pipeline.py                        # Pipeline orchestration
    ├── loaders/
    │   ├── base_loader.py                 # Chunked CSV loader base class
    │   ├── category_loader.py             # Category (dedup, incremental)
    │   ├── headcount_loader.py            # HeadCount (two-pass supervisor resolve)
    │   ├── item_loader.py                 # Item Data (61 columns, SCD2 feed)
    │   ├── learning_plan_loader.py        # Learning Plan (39 columns)
    │   └── learning_history_loader.py     # Learning History (43 columns)
    └── transforms/
        └── executor.py                    # Calls EXEC dwh.usp_load_* in order

Troubleshooting

pyodbc.Error: IM002 Data source name not found Install ODBC Driver 17: https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server

Login failed for user Ensure your Windows account has db_datareader + db_datawriter + db_ddladmin roles on LMS_DWH, or is sysadmin on the instance.

Invalid column name 'item_id' in quarantine SP Known bug in 01_ddl_schemas_staging_audit.sql — index ix_quarantine_item references item_id but the quarantine table uses entity_id. Drop and recreate the index if it exists.

Progress bar shows wrong total (e.g. 75,265 instead of 39,024 for Item Data) Fixed in current version — _count_rows() now uses csv.reader to correctly handle quoted fields containing embedded newlines.

UPDATE failed because SET options have incorrect settings: QUOTED_IDENTIFIER The usp_load_d_item stored procedure must be created with SET QUOTED_IDENTIFIER ON in effect. The 04_stored_procedures.sql file already includes this. Re-run the file if you see this error.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors