Skip to content

Stoi during MPI tasks #22

@pixelHat

Description

@pixelHat

This fxt prof_file_1.zip generates the following error:

StarVZ Phase 1 - Start of /home/hall/Projects/CMP30(X)/mandelbrot/fxt2/20
Sat 19 Oct 2024 23:03:18 -03
~/Projects/CMP30(X)/mandelbrot/fxt2/20 ~/R/x86_64-pc-linux-gnu-library/4.4/starvz/tools
Convert from FXT to paje.sorted.trace
Execute stapu_fxt_tool
Sat 19 Oct 2024 23:03:18 -03
Sort paje.trace
Sat 19 Oct 2024 23:03:18 -03
Execute pmtool
Sat 19 Oct 2024 23:03:18 -03
Lionel's pmtool or platform_file.rec file are not available, skipping it.
Convert Rec files
Sat 19 Oct 2024 23:03:18 -03
Convert from paje.sorted.trace to paje.csv
Sat 19 Oct 2024 23:03:18 -03
Wait
Sat 19 Oct 2024 23:03:18 -03
Get states, links and variables in CSV
Sat 19 Oct 2024 23:03:18 -03
Convert (DAG) DOT to CSV
Sat 19 Oct 2024 23:03:18 -03
Convert (ATREE) DOT to CSV
Sat 19 Oct 2024 23:03:18 -03
Post-processing CSV files
Sat 19 Oct 2024 23:03:18 -03
23:03:19 Reading  ./entities.csv
23:03:19 Starting the tree filtering to create Y coordinates
23:03:19 Starting y_coordinates
23:03:19 Reading ./paje.worker_state.csv.gz
23:03:19 Selecting application states based on runtime states.
23:03:19 This is a single-node trace...
23:03:19 Outlier detection using standard model
23:03:20 Reading  ./paje.variable.csv.gz
23:03:20 This is a single-node trace...
23:03:20 Saving as parquet
23:03:20 ./variable.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Files ./pmtool.feather or ./pmtool.csv do not exist.
23:03:20 Saving as parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading  ./rec.data_handles.csv.gz
23:03:20 Saving as parquet
23:03:20 ./data_handles.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 File ./rec.papi.csv.gz do not exist.
23:03:20 Saving as parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading  ./paje.memory_state.csv.gz
23:03:20 Saving as parquet
23:03:20 ./memory_state.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading  ./paje.comm_state.csv.gz
23:03:20 After reading Comm States, number of rows is zero.
23:03:20 Saving as parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading  ./paje.other_state.csv.gz
23:03:20 Saving as parquet
23:03:20 ./other_state.parquet
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:20 Reading  ./paje.events.csv.gz
23:03:20 Saving as parquet
23:03:20 ./events.parquet
23:03:20 ./events_data.parquet
23:03:20 ./events_memory.parquet
23:03:20 Data for ./events_memory.parquet has not been feathered because is empty.
23:03:20 ./origin.parquet
23:03:20 Data for ./origin.parquet has not been feathered because is empty.
23:03:21 Reading  ./rec.tasks.csv.gz
23:03:21 Saving as parquet
23:03:21 ./tasks.parquet
23:03:21 ./task_handles.parquet
23:03:21 ./origin.parquet
23:03:21 Data for ./origin.parquet has not been feathered because is empty.
23:03:21 Reading  ./paje.link.csv.gz
23:03:21 File ./atree.csv do not exist.
23:03:21 Reading  ./dag.csv.gz
23:03:21 Merge state data with the DAG
23:03:21 Get MPI tasks (links) to enrich the DAG
Error: stoi
In addition: Warning messages:
1: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)
2: The following named parsers don't match the column names: Control, Model
3: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)
4: The following named parsers don't match the column names: Node, DependsOn
Execution halted
Error when executing phase1-workflow.R (exit status: 1)
Phase 1 Failed (exit status: 2) stopping

The issue raises in the read_dag function, which has the column MPIType and the levels for the Dest contains the CPUs and CUDAs.

starvz/R/phase1_parse_csv.R

Lines 970 to 1040 in 553b972

read_dag <- function(where = ".", Application = NULL, dfl = NULL) {
dag.csv <- paste0(where, "/dag.csv.gz")
if (file.exists(dag.csv)) {
starvz_log(paste("Reading ", dag.csv))
dfdag <- starvz_suppressWarnings(read_csv(dag.csv,
trim_ws = TRUE,
progress = FALSE,
col_types = cols(
Node = col_integer(),
DependsOn = col_integer()
)
))
} else {
starvz_warn(paste("File", dag.csv, "do not exist"))
return(NULL)
}
# Read the DAG in the CSV format, do some clean-ups
dfdag <- dfdag %>%
# Put in the right order
select("JobId", "Dependent") %>%
# Communication task ids have too much information, clean-up both columns (JobId, Dependent)
mutate(JobId = gsub("mpi_.*_", "mpicom_", .data$JobId)) %>%
mutate(Dependent = gsub("mpi_.*_", "mpicom_", .data$Dependent))
# Check Application existence
stopifnot(!is.null(Application))
starvz_log("Merge state data with the DAG")
# Do the two merges (states and links)
dfdags <- dfdag %>%
# Get only non-MPI tasks JobIds
filter(!grepl("mpicom", .data$JobId)) %>%
# Merge task information from the trace
full_join(Application, by = "JobId")
# Check dfl existence
if (!is.null(dfl)) {
starvz_log("Get MPI tasks (links) to enrich the DAG")
dfdagl <- dfdag %>%
# Get only MPI tasks JobIds
filter(grepl("mpicom", .data$JobId)) %>%
# Merge MPI communicaton task information from the trace (links: dfl)
full_join(dfl, by = c("JobId" = "Key")) %>%
# Align columns with state-based tasks
# 1. Remove columns
select(-"Container", -"Origin") %>%
# 2. Dest becomes ResourceId for these MPI tasks
rename(ResourceId = "Dest") %>%
mutate(ResourceId = as.factor(.data$ResourceId)) %>%
separate_res() %>%
tibble() %>%
mutate(Resource = as.factor(.data$Resource)) %>%
mutate(Node = as.factor(.data$Node)) %>%
mutate(ResourceType = as.factor(gsub("[[:digit:]]+", "", .data$Resource)))
dfdag <- dfdags %>% bind_rows(dfdagl)
} else {
dfdag <- dfdags
}
# Finally, bind everything together, calculate cost to CPB
dfdag <- dfdag %>%
mutate(Dependent = as.factor(.data$Dependent)) %>%
# Calculate the cost as the inverse of the duration (so boost's CPB code can work)
mutate(Cost = ifelse(is.na(.data$Duration), 0, -.data$Duration)) %>%
# Force the result as tibble for performance reasons
select("JobId", "Dependent", "Start", "End", "Cost", "Value") %>%
as_tibble()
}

The dfl has the following format

# A tibble: 1 × 13
  Container Type   Start     End Duration  Size Origin Dest  Key   Tag   MPIType
  <fct>     <fct>  <dbl>   <dbl>    <dbl> <int> <fct>  <fct> <fct> <fct> <fct>
1 program   Intr-0.387 -0.0227    0.364 12800 MEMMAMEMMcom_1 5602x
# ℹ 2 more variables: Priority <int>, Handle <fct>

The levels for the Dest column are

[1] "CPU0"        "CPU1"        "CPU10"       "CPU11"       "CPU12"
 [6] "CPU13"       "CPU14"       "CPU15"       "CPU2"        "CPU3"
[11] "CPU4"        "CPU5"        "CPU6"        "CPU7"        "CPU8"
[16] "CPU9"        "CUDA0_0"     "CUDA1_0"     "MEMMANAGER0" "MEMMANAGER1"
[21] "MEMMANAGER2"

read_dag calls the c++ function separete_res which tries to split the CUDA0_0 into CUDA0 and 0. And tries to convert CUDA0 to an integer.

I suppose in a real MPI app, we would have values like MEMMANAGER0_0, MEMMANAGER0_1, etc.

for(int i=0; i<lvls; i++){
std::string x = Rcpp::as<std::string>(res_levels[i]);
size_t pos = x.find("_");
// Single Node
if(pos == std::string::npos){
lvl_Node[i] = 0;
lvl_Reso[i] = x;
}else{ //Multi-Node
lvl_Node[i] = std::stoi(x.substr(0, pos));
if(pos+1 == x.size()){
lvl_Reso[i] = NA_STRING;
}else{
lvl_Reso[i] = x.substr(pos+1, x.size());
}
}

I don't know why I have the CPU and CUDA for this fxt file. I executed the same code with different factors and got a different dfl format

# A tibble: 1 × 13
  Container Type    Start   End Duration   Size Origin Dest  Key   Tag   MPIType
  <fct>     <fct>   <dbl> <dbl>    <dbl>  <int> <fct>  <fct> <fct> <fct> <fct>
1 program   Intra-145.  145.    0.124 829440 MEMMAMEMMcom_5566x
# ℹ 2 more variables: Priority <int>, Handle <fct>

and the levels for the Dest are

[1] "MEMMANAGER0" "MEMMANAGER1"

fxt that works fine prof_file_without_spaces.zip.

There is a small difference in the code, which shouldn't have any impact.

The "bugged fxt" defines the following variables

static int height_p = 400;
static int width_p = 640;

While the "right fxt" define them as:

static int height_p = 4320;
static int width_p = 7680;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions