Stoi during MPI tasks

This fxt [prof_file_1.zip](https://github.com/user-attachments/files/17603016/prof_file_1.zip) generates the following error:

```log
StarVZ Phase 1 - Start of /home/hall/Projects/CMP30(X)/mandelbrot/fxt2/20
Sat 19 Oct 2024 23:03:18 -03
~/Projects/CMP30(X)/mandelbrot/fxt2/20 ~/R/x86_64-pc-linux-gnu-library/4.4/starvz/tools
Convert from FXT to paje.sorted.trace
Execute stapu_fxt_tool
Sat 19 Oct 2024 23:03:18 -03
Sort paje.trace
Sat 19 Oct 2024 23:03:18 -03
Execute pmtool
Sat 19 Oct 2024 23:03:18 -03
Lionel's pmtool or platform_file.rec file are not available, skipping it.
Convert Rec files
Sat 19 Oct 2024 23:03:18 -03
Convert from paje.sorted.trace to paje.csv
Sat 19 Oct 2024 23:03:18 -03
Wait
Sat 19 Oct 2024 23:03:18 -03
Get states, links and variables in CSV
Sat 19 Oct 2024 23:03:18 -03
Convert (DAG) DOT to CSV
Sat 19 Oct 2024 23:03:18 -03
Convert (ATREE) DOT to CSV
Sat 19 Oct 2024 23:03:18 -03
Post-processing CSV files
Sat 19 Oct 2024 23:03:18 -03
23:03:19 Reading  ./entities.csv
23:03:19 Starting the tree filtering to create Y coordinates
23:03:19 Starting y_coordinates
23:03:19 Reading ./paje.worker_state.csv.gz
23:03:19 Selecting application states based on runtime states.
23:03:19 This is a single-node trace...
23:03:19 Outlier detection using standard model
23:03:20 Reading  ./paje.variable.csv.gz
23:03:20 This is a single-node trace...
23:03:20 Saving as parquet
23:03:20 ./variable.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Files ./pmtool.feather or ./pmtool.csv do not exist.
23:03:20 Saving as parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading  ./rec.data_handles.csv.gz
23:03:20 Saving as parquet
23:03:20 ./data_handles.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 File ./rec.papi.csv.gz do not exist.
23:03:20 Saving as parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading  ./paje.memory_state.csv.gz
23:03:20 Saving as parquet
23:03:20 ./memory_state.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading  ./paje.comm_state.csv.gz
23:03:20 After reading Comm States, number of rows is zero.
23:03:20 Saving as parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading  ./paje.other_state.csv.gz
23:03:20 Saving as parquet
23:03:20 ./other_state.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading  ./paje.events.csv.gz
23:03:20 Saving as parquet
23:03:20 ./events.parquet
23:03:20 ./events_data.parquet
23:03:20 ./events_memory.parquet
23:03:20 Data for ./events_memory.parquet has not been feathered because is empty.
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:21 Reading  ./rec.tasks.csv.gz
23:03:21 Saving as parquet
23:03:21 ./tasks.parquet
23:03:21 ./task_handles.parquet
23:03:21 ./origin.parquet
23:03:21 Data for ./origin.parquet has not been feathered because is empty.
23:03:21 Reading  ./paje.link.csv.gz
23:03:21 File ./atree.csv do not exist.
23:03:21 Reading  ./dag.csv.gz
23:03:21 Merge state data with the DAG
23:03:21 Get MPI tasks (links) to enrich the DAG
Error: stoi
In addition: Warning messages:
1: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)
2: The following named parsers don't match the column names: Control, Model
3: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)
4: The following named parsers don't match the column names: Node, DependsOn
Execution halted
Error when executing phase1-workflow.R (exit status: 1)
Phase 1 Failed (exit status: 2) stopping
```

The issue raises in the `read_dag` function, which has the column `MPIType` and the levels for the `Dest` contains the CPUs and CUDAs.

https://github.com/schnorr/starvz/blob/553b972cfaef9a7f52a203fa7188a70013f21cfd/R/phase1_parse_csv.R#L970-L1040

The `dfl` has the following format

```R
# A tibble: 1 × 13
  Container Type   Start     End Duration  Size Origin Dest  Key   Tag   MPIType
  <fct>     <fct>  <dbl>   <dbl>    <dbl> <int> <fct>  <fct> <fct> <fct> <fct>
1 program   Intr… -0.387 -0.0227    0.364 12800 MEMMA… MEMM… com_1 5602… x
# ℹ 2 more variables: Priority <int>, Handle <fct>
```

The levels for the `Dest` column are

```R
[1] "CPU0"        "CPU1"        "CPU10"       "CPU11"       "CPU12"
 [6] "CPU13"       "CPU14"       "CPU15"       "CPU2"        "CPU3"
[11] "CPU4"        "CPU5"        "CPU6"        "CPU7"        "CPU8"
[16] "CPU9"        "CUDA0_0"     "CUDA1_0"     "MEMMANAGER0" "MEMMANAGER1"
[21] "MEMMANAGER2"
```

`read_dag` calls the c++ function `separete_res` which tries to split the CUDA0_0 into CUDA0 and 0. And tries to convert CUDA0 to an integer.

I suppose in a real MPI app, we would have values like `MEMMANAGER0_0`, `MEMMANAGER0_1`, etc.

https://github.com/schnorr/starvz/blob/553b972cfaef9a7f52a203fa7188a70013f21cfd/src/separate_res.cpp#L19-L33

I don't know why I have the CPU and CUDA for this fxt file.  I executed the same code with different factors and got a different `dfl` format

```R
# A tibble: 1 × 13
  Container Type    Start   End Duration   Size Origin Dest  Key   Tag   MPIType
  <fct>     <fct>   <dbl> <dbl>    <dbl>  <int> <fct>  <fct> <fct> <fct> <fct>
1 program   Intra-…  145.  145.    0.124 829440 MEMMA… MEMM… com_… 5566… x
# ℹ 2 more variables: Priority <int>, Handle <fct>
```

and the levels for the `Dest` are

```R
[1] "MEMMANAGER0" "MEMMANAGER1"
```
fxt that works fine [prof_file_without_spaces.zip](https://github.com/user-attachments/files/17603093/prof_file_without_spaces.zip).

There is a small difference in the code, which shouldn't have any impact.

The "bugged fxt" defines the following variables

```c
static int height_p = 400;
static int width_p = 640;
```

While the "right fxt" define them as:

```c
static int height_p = 4320;
static int width_p = 7680;
```


	read_dag <- function(where = ".", Application = NULL, dfl = NULL) {
	dag.csv <- paste0(where, "/dag.csv.gz")
	if (file.exists(dag.csv)) {
	starvz_log(paste("Reading ", dag.csv))
	dfdag <- starvz_suppressWarnings(read_csv(dag.csv,
	trim_ws = TRUE,
	progress = FALSE,
	col_types = cols(
	Node = col_integer(),
	DependsOn = col_integer()
	)
	))
	} else {
	starvz_warn(paste("File", dag.csv, "do not exist"))
	return(NULL)
	}

	# Read the DAG in the CSV format, do some clean-ups
	dfdag <- dfdag %>%
	# Put in the right order
	select("JobId", "Dependent") %>%
	# Communication task ids have too much information, clean-up both columns (JobId, Dependent)
	mutate(JobId = gsub("mpi_.*_", "mpicom_", .data$JobId)) %>%
	mutate(Dependent = gsub("mpi_.*_", "mpicom_", .data$Dependent))

	# Check Application existence
	stopifnot(!is.null(Application))

	starvz_log("Merge state data with the DAG")

	# Do the two merges (states and links)
	dfdags <- dfdag %>%
	# Get only non-MPI tasks JobIds
	filter(!grepl("mpicom", .data$JobId)) %>%
	# Merge task information from the trace
	full_join(Application, by = "JobId")

	# Check dfl existence
	if (!is.null(dfl)) {
	starvz_log("Get MPI tasks (links) to enrich the DAG")

	dfdagl <- dfdag %>%
	# Get only MPI tasks JobIds
	filter(grepl("mpicom", .data$JobId)) %>%
	# Merge MPI communicaton task information from the trace (links: dfl)
	full_join(dfl, by = c("JobId" = "Key")) %>%
	# Align columns with state-based tasks
	# 1. Remove columns
	select(-"Container", -"Origin") %>%
	# 2. Dest becomes ResourceId for these MPI tasks
	rename(ResourceId = "Dest") %>%
	mutate(ResourceId = as.factor(.data$ResourceId)) %>%
	separate_res() %>%
	tibble() %>%
	mutate(Resource = as.factor(.data$Resource)) %>%
	mutate(Node = as.factor(.data$Node)) %>%
	mutate(ResourceType = as.factor(gsub("[[:digit:]]+", "", .data$Resource)))
	dfdag <- dfdags %>% bind_rows(dfdagl)
	} else {
	dfdag <- dfdags
	}

	# Finally, bind everything together, calculate cost to CPB
	dfdag <- dfdag %>%
	mutate(Dependent = as.factor(.data$Dependent)) %>%
	# Calculate the cost as the inverse of the duration (so boost's CPB code can work)
	mutate(Cost = ifelse(is.na(.data$Duration), 0, -.data$Duration)) %>%
	# Force the result as tibble for performance reasons
	select("JobId", "Dependent", "Start", "End", "Cost", "Value") %>%
	as_tibble()
	}

	for(int i=0; i<lvls; i++){
	std::string x = Rcpp::as<std::string>(res_levels[i]);
	size_t pos = x.find("_");
	// Single Node
	if(pos == std::string::npos){
	lvl_Node[i] = 0;
	lvl_Reso[i] = x;
	}else{ //Multi-Node
	lvl_Node[i] = std::stoi(x.substr(0, pos));
	if(pos+1 == x.size()){
	lvl_Reso[i] = NA_STRING;
	}else{
	lvl_Reso[i] = x.substr(pos+1, x.size());
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stoi during MPI tasks #22

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Stoi during MPI tasks #22

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions