tf_idf is computed on unecessary duplicates 

Hi, 

In the case of dynamic network analysis with overlapping networks, the total list of nodes is superior to the original number of observations since one observation can "travel" between different merged clusters over different time windows. This is a normal feature of the functions provided by the package. 

However, they can be also duplicated inside a merged cluster, namely, an observation that does not travel between clusters. When the function extract_tf_idf function is computed, it does not delete those duplicates. The list of tokens is built on a corpus that may contain x times the same document (x being the number of times the node is in the cluster). At the very least, this feature should be added to the function documentation to ensure replicability. 

Also, I wonder if it could lead to a biased representation of the community by overemphasizing not traveling nodes over traveling nodes. One could argue that it is a good feature of the function but the tf-idf score, and, in particular, the idf, already gives more weight to words used in nodes that tend to travel less. 

here is a replicable example, although the difference in the two computations is not that salient with the data provided by the package 

`library(networkflow)
library(tidyverse)
library(tidytext)


edges <- networkflow::Ref_stagflation %>%
  reframe(source_id = Citing_ItemID_Ref,
          target_id = ItemID_Ref)

nodes <- networkflow::Nodes_stagflation  %>%
  rename(source_id = ItemID_Ref) %>%
  filter(Year > 1990)



graph <- networkflow::build_dynamic_networks(
  nodes = nodes,
  directed_edges = edges,
  source_id = "source_id", 
  target_id = "target_id", 
  cooccurrence_method = "coupling_angle",
  time_variable = "Year", 
  time_window = 5, 
  overlapping_window = TRUE,
  filter_components = TRUE,
  keep_singleton = FALSE)


#add cluster 
graph_with_cluster <- add_clusters(graph, 
                                   clustering_method = "leiden",
                                   objective_function = "modularity", 
                                   resolution = 1,
                                   seed = 123)

#merged clusters
graph_with_dynamic_cluster <- networkflow::merge_dynamic_clusters(graph_with_cluster, 
                                                                  cluster_id = 'cluster_leiden',
                                                                  node_id = "source_id",
                                                                  threshold_similarity  = 0.51,
                                                                  similarity_type = "partial"
)

tf_idf <- networkflow::extract_tfidf(graph_with_dynamic_cluster,
                                     text_columns = "Title",
                                     grouping_columns = "dynamic_cluster_leiden",
                                     n_gram = 2,
                                     # stopwords_type = "smart",
                                     clean_word_method = "none",
                                     nb_terms = 10)


tfidf2 <-
  lapply(graph_with_dynamic_cluster, function(graph)
    graph %>%
      activate(nodes) %>%
      as_tibble()) %>%
  bind_rows(.) %>%
  select(-cluster_leiden, -time_window, -size_cluster_leiden) %>%
  group_by(dynamic_cluster_leiden) %>%
  unique() %>%
  unnest_ngrams(input = Title, 
                output = token,
                n = 2,
                n_min = 1,
                to_lower = TRUE) %>%
  filter(!str_detect(token, paste0("\\b", stop_words$word, "\\b", collapse = "|"))) %>%
  filter(!str_detect(token, "[:digit:]{1,3}")) %>% 
  ungroup() %>%
  add_count(token, dynamic_cluster_leiden, name = "n") %>%
  select(dynamic_cluster_leiden, token, n) %>%
  unique() %>%
  bind_tf_idf(token, dynamic_cluster_leiden, n) %>%
  group_by(dynamic_cluster_leiden) %>%
  slice_max(tf_idf, n = 10)


print(tf_idf %>% filter(dynamic_cluster_leiden == "cl_1"))
print(tfidf2 %>% filter(dynamic_cluster_leiden == "cl_1"))

`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf_idf is computed on unecessary duplicates #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

tf_idf is computed on unecessary duplicates #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions