-
Notifications
You must be signed in to change notification settings - Fork 5
Description
This fxt prof_file_1.zip generates the following error:
StarVZ Phase 1 - Start of /home/hall/Projects/CMP30(X)/mandelbrot/fxt2/20
Sat 19 Oct 2024 23:03:18 -03
~/Projects/CMP30(X)/mandelbrot/fxt2/20 ~/R/x86_64-pc-linux-gnu-library/4.4/starvz/tools
Convert from FXT to paje.sorted.trace
Execute stapu_fxt_tool
Sat 19 Oct 2024 23:03:18 -03
Sort paje.trace
Sat 19 Oct 2024 23:03:18 -03
Execute pmtool
Sat 19 Oct 2024 23:03:18 -03
Lionel's pmtool or platform_file.rec file are not available, skipping it.
Convert Rec files
Sat 19 Oct 2024 23:03:18 -03
Convert from paje.sorted.trace to paje.csv
Sat 19 Oct 2024 23:03:18 -03
Wait
Sat 19 Oct 2024 23:03:18 -03
Get states, links and variables in CSV
Sat 19 Oct 2024 23:03:18 -03
Convert (DAG) DOT to CSV
Sat 19 Oct 2024 23:03:18 -03
Convert (ATREE) DOT to CSV
Sat 19 Oct 2024 23:03:18 -03
Post-processing CSV files
Sat 19 Oct 2024 23:03:18 -03
23:03:19 Reading ./entities.csv
23:03:19 Starting the tree filtering to create Y coordinates
23:03:19 Starting y_coordinates
23:03:19 Reading ./paje.worker_state.csv.gz
23:03:19 Selecting application states based on runtime states.
23:03:19 This is a single-node trace...
23:03:19 Outlier detection using standard model
23:03:20 Reading ./paje.variable.csv.gz
23:03:20 This is a single-node trace...
23:03:20 Saving as parquet
23:03:20 ./variable.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Files ./pmtool.feather or ./pmtool.csv do not exist.
23:03:20 Saving as parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading ./rec.data_handles.csv.gz
23:03:20 Saving as parquet
23:03:20 ./data_handles.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 File ./rec.papi.csv.gz do not exist.
23:03:20 Saving as parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading ./paje.memory_state.csv.gz
23:03:20 Saving as parquet
23:03:20 ./memory_state.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading ./paje.comm_state.csv.gz
23:03:20 After reading Comm States, number of rows is zero.
23:03:20 Saving as parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading ./paje.other_state.csv.gz
23:03:20 Saving as parquet
23:03:20 ./other_state.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading ./paje.events.csv.gz
23:03:20 Saving as parquet
23:03:20 ./events.parquet
23:03:20 ./events_data.parquet
23:03:20 ./events_memory.parquet
23:03:20 Data for ./events_memory.parquet has not been feathered because is empty.
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:21 Reading ./rec.tasks.csv.gz
23:03:21 Saving as parquet
23:03:21 ./tasks.parquet
23:03:21 ./task_handles.parquet
23:03:21 ./origin.parquet
23:03:21 Data for ./origin.parquet has not been feathered because is empty.
23:03:21 Reading ./paje.link.csv.gz
23:03:21 File ./atree.csv do not exist.
23:03:21 Reading ./dag.csv.gz
23:03:21 Merge state data with the DAG
23:03:21 Get MPI tasks (links) to enrich the DAG
Error: stoi
In addition: Warning messages:
1: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
2: The following named parsers don't match the column names: Control, Model
3: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
4: The following named parsers don't match the column names: Node, DependsOn
Execution halted
Error when executing phase1-workflow.R (exit status: 1)
Phase 1 Failed (exit status: 2) stopping
The issue raises in the read_dag function, which has the column MPIType and the levels for the Dest contains the CPUs and CUDAs.
Lines 970 to 1040 in 553b972
| read_dag <- function(where = ".", Application = NULL, dfl = NULL) { | |
| dag.csv <- paste0(where, "/dag.csv.gz") | |
| if (file.exists(dag.csv)) { | |
| starvz_log(paste("Reading ", dag.csv)) | |
| dfdag <- starvz_suppressWarnings(read_csv(dag.csv, | |
| trim_ws = TRUE, | |
| progress = FALSE, | |
| col_types = cols( | |
| Node = col_integer(), | |
| DependsOn = col_integer() | |
| ) | |
| )) | |
| } else { | |
| starvz_warn(paste("File", dag.csv, "do not exist")) | |
| return(NULL) | |
| } | |
| # Read the DAG in the CSV format, do some clean-ups | |
| dfdag <- dfdag %>% | |
| # Put in the right order | |
| select("JobId", "Dependent") %>% | |
| # Communication task ids have too much information, clean-up both columns (JobId, Dependent) | |
| mutate(JobId = gsub("mpi_.*_", "mpicom_", .data$JobId)) %>% | |
| mutate(Dependent = gsub("mpi_.*_", "mpicom_", .data$Dependent)) | |
| # Check Application existence | |
| stopifnot(!is.null(Application)) | |
| starvz_log("Merge state data with the DAG") | |
| # Do the two merges (states and links) | |
| dfdags <- dfdag %>% | |
| # Get only non-MPI tasks JobIds | |
| filter(!grepl("mpicom", .data$JobId)) %>% | |
| # Merge task information from the trace | |
| full_join(Application, by = "JobId") | |
| # Check dfl existence | |
| if (!is.null(dfl)) { | |
| starvz_log("Get MPI tasks (links) to enrich the DAG") | |
| dfdagl <- dfdag %>% | |
| # Get only MPI tasks JobIds | |
| filter(grepl("mpicom", .data$JobId)) %>% | |
| # Merge MPI communicaton task information from the trace (links: dfl) | |
| full_join(dfl, by = c("JobId" = "Key")) %>% | |
| # Align columns with state-based tasks | |
| # 1. Remove columns | |
| select(-"Container", -"Origin") %>% | |
| # 2. Dest becomes ResourceId for these MPI tasks | |
| rename(ResourceId = "Dest") %>% | |
| mutate(ResourceId = as.factor(.data$ResourceId)) %>% | |
| separate_res() %>% | |
| tibble() %>% | |
| mutate(Resource = as.factor(.data$Resource)) %>% | |
| mutate(Node = as.factor(.data$Node)) %>% | |
| mutate(ResourceType = as.factor(gsub("[[:digit:]]+", "", .data$Resource))) | |
| dfdag <- dfdags %>% bind_rows(dfdagl) | |
| } else { | |
| dfdag <- dfdags | |
| } | |
| # Finally, bind everything together, calculate cost to CPB | |
| dfdag <- dfdag %>% | |
| mutate(Dependent = as.factor(.data$Dependent)) %>% | |
| # Calculate the cost as the inverse of the duration (so boost's CPB code can work) | |
| mutate(Cost = ifelse(is.na(.data$Duration), 0, -.data$Duration)) %>% | |
| # Force the result as tibble for performance reasons | |
| select("JobId", "Dependent", "Start", "End", "Cost", "Value") %>% | |
| as_tibble() | |
| } |
The dfl has the following format
# A tibble: 1 × 13
Container Type Start End Duration Size Origin Dest Key Tag MPIType
<fct> <fct> <dbl> <dbl> <dbl> <int> <fct> <fct> <fct> <fct> <fct>
1 program Intr… -0.387 -0.0227 0.364 12800 MEMMA… MEMM… com_1 5602… x
# ℹ 2 more variables: Priority <int>, Handle <fct>The levels for the Dest column are
[1] "CPU0" "CPU1" "CPU10" "CPU11" "CPU12"
[6] "CPU13" "CPU14" "CPU15" "CPU2" "CPU3"
[11] "CPU4" "CPU5" "CPU6" "CPU7" "CPU8"
[16] "CPU9" "CUDA0_0" "CUDA1_0" "MEMMANAGER0" "MEMMANAGER1"
[21] "MEMMANAGER2"read_dag calls the c++ function separete_res which tries to split the CUDA0_0 into CUDA0 and 0. And tries to convert CUDA0 to an integer.
I suppose in a real MPI app, we would have values like MEMMANAGER0_0, MEMMANAGER0_1, etc.
Lines 19 to 33 in 553b972
| for(int i=0; i<lvls; i++){ | |
| std::string x = Rcpp::as<std::string>(res_levels[i]); | |
| size_t pos = x.find("_"); | |
| // Single Node | |
| if(pos == std::string::npos){ | |
| lvl_Node[i] = 0; | |
| lvl_Reso[i] = x; | |
| }else{ //Multi-Node | |
| lvl_Node[i] = std::stoi(x.substr(0, pos)); | |
| if(pos+1 == x.size()){ | |
| lvl_Reso[i] = NA_STRING; | |
| }else{ | |
| lvl_Reso[i] = x.substr(pos+1, x.size()); | |
| } | |
| } |
I don't know why I have the CPU and CUDA for this fxt file. I executed the same code with different factors and got a different dfl format
# A tibble: 1 × 13
Container Type Start End Duration Size Origin Dest Key Tag MPIType
<fct> <fct> <dbl> <dbl> <dbl> <int> <fct> <fct> <fct> <fct> <fct>
1 program Intra-… 145. 145. 0.124 829440 MEMMA… MEMM… com_… 5566… x
# ℹ 2 more variables: Priority <int>, Handle <fct>and the levels for the Dest are
[1] "MEMMANAGER0" "MEMMANAGER1"fxt that works fine prof_file_without_spaces.zip.
There is a small difference in the code, which shouldn't have any impact.
The "bugged fxt" defines the following variables
static int height_p = 400;
static int width_p = 640;While the "right fxt" define them as:
static int height_p = 4320;
static int width_p = 7680;