Distill/fuzzymatch.Rmd at main · AbidAliShaikh/Distill · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: "Fuzzy Matching Using R"
author: "Abid Ali Shaikh"
date: "04/14/2023"
output:
  html_document:
    code_folding: show
    theme:
      bg: "#202123"
      fg: "#B8BCC2"
      primary: "#EA80FC"
      base_font:
        google: Prompt
      heading_font:
        google: Proza Libre
      version: 3
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r eval=FALSE, include=FALSE}
df1 <- readxl::read_excel("C:/Users/Admin/OneDrive/PROGRAMMING/RSpace/2022-23/data/FacultytoCNIC.xlsx")
df2 <- readxl::read_excel("C:/Users/Admin/OneDrive/PROGRAMMING/RSpace/2022-23/data/FacultytoSubID.xlsx",sheet = 2)

exclude <- c("Department", "Institute", "Dr.", "Mr.", "Ms."," of ","\\.","&","()")
levDist=0.3

names(df1)<- c("Key","toMatch")
names(df2)<- c("Key","toMatch")


```

#### provide below two data frames with columns "Key" and "toMatch" {.Description style="color:cyan;width:100%;"}

[*the 'toMatch' columns in df1 will be looked up in 'toMatch' df2 approximate values and finally a csv will be generated having all matched of both data according to levDist*]{.smallcaps}

```{r fuzzy matching, eval=FALSE, include=FALSE}

 library(stringr)
# Create a function to remove exclusions and spaces
clean_string <- function(x) {
  str_replace_all(x, paste0("(", paste(exclude, collapse = "|"), ")"), "")
}

signature=function(x){
                   sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
                                     return(sig)
                     }

fuzy <- function(df1,df2,exclude,levDeist){


  df1$toMatch<- clean_string(df1$toMatch)
  df2$toMatch<- clean_string(df2$toMatch)
  x=df1$toMatch

  y=df2$toMatch
  xx=data.frame(sig=sapply(x, signature),row.names=NULL)
  yy=data.frame(sig=sapply(y, signature),row.names=NULL)
#Add the original words to the data frame too...
  xx$raw=x;xx$Key=df1$Key
  yy$raw=y;yy$Key=df2$Key
  #We only want words that have a signature...
  xx=subset(xx,subset=(sig!=''))
  xx$partials= as.character(sapply(xx$sig, agrep, yy$sig,max.distance = levDist,value=T))
  #Bring the original text into the partial match list based on the sig key.
  xx=merge(xx,yy,by.x='partials',by.y='sig')
    #write.csv(xx,'../final.csv')

  }

```

| Name | Designation                    |
|------|--------------------------------|
| Abid | Senior Data Processing Officer |

Fuzzy matching is another technique that allows you to match strings that are similar but not identical. It can be useful in various applications, including data cleaning, data integration, and text analysis. When writing a script, fuzzy matching can be used to:

1.  Clean and standardize data: Fuzzy matching can help you identify and correct misspellings, inconsistencies, and other errors in your data. For example, you can use fuzzy matching to match names of people or organizations that are spelled differently in different datasets.

2.  Merge data from multiple sources: Fuzzy matching can help you merge data from different sources that have similar but not identical records. For example, you can use fuzzy matching to match records of customers who have the same name but different addresses or phone numbers.

3.  Find similar records: Fuzzy matching can help you identify records that are similar to a given record. For example, you can use fuzzy matching to find records that are similar to a given product or customer based on their names, descriptions, or other attributes.

To use fuzzy matching in your script, you can use packages such as **`stringdist`**, **`fuzzyjoin`**, or **`stringr`** in R. These packages provide functions that can be used to calculate string distances, match strings based on their similarity, and extract information from strings. Depending on the specific task you are trying to accomplish, you can choose the appropriate package and function to use.

[^1]Blog-Posts

[^1]: My other blog Posts

    [Forget ChatGPT instead... -- Distill (wordpress.com)](https://distillshiny.wordpress.com/2023/04/21/forget-chatgpt-instead/)