-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathfuzzymatch.Rmd
More file actions
99 lines (71 loc) · 4.09 KB
/
fuzzymatch.Rmd
File metadata and controls
99 lines (71 loc) · 4.09 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: "Fuzzy Matching Using R"
author: "Abid Ali Shaikh"
date: "04/14/2023"
output:
html_document:
code_folding: show
theme:
bg: "#202123"
fg: "#B8BCC2"
primary: "#EA80FC"
base_font:
google: Prompt
heading_font:
google: Proza Libre
version: 3
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r eval=FALSE, include=FALSE}
df1 <- readxl::read_excel("C:/Users/Admin/OneDrive/PROGRAMMING/RSpace/2022-23/data/FacultytoCNIC.xlsx")
df2 <- readxl::read_excel("C:/Users/Admin/OneDrive/PROGRAMMING/RSpace/2022-23/data/FacultytoSubID.xlsx",sheet = 2)
exclude <- c("Department", "Institute", "Dr.", "Mr.", "Ms."," of ","\\.","&","()")
levDist=0.3
names(df1)<- c("Key","toMatch")
names(df2)<- c("Key","toMatch")
```
#### provide below two data frames with columns "Key" and "toMatch" {.Description style="color:cyan;width:100%;"}
[*the 'toMatch' columns in df1 will be looked up in 'toMatch' df2 approximate values and finally a csv will be generated having all matched of both data according to levDist*]{.smallcaps}
```{r fuzzy matching, eval=FALSE, include=FALSE}
library(stringr)
# Create a function to remove exclusions and spaces
clean_string <- function(x) {
str_replace_all(x, paste0("(", paste(exclude, collapse = "|"), ")"), "")
}
signature=function(x){
sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
return(sig)
}
fuzy <- function(df1,df2,exclude,levDeist){
df1$toMatch<- clean_string(df1$toMatch)
df2$toMatch<- clean_string(df2$toMatch)
x=df1$toMatch
y=df2$toMatch
xx=data.frame(sig=sapply(x, signature),row.names=NULL)
yy=data.frame(sig=sapply(y, signature),row.names=NULL)
#Add the original words to the data frame too...
xx$raw=x;xx$Key=df1$Key
yy$raw=y;yy$Key=df2$Key
#We only want words that have a signature...
xx=subset(xx,subset=(sig!=''))
xx$partials= as.character(sapply(xx$sig, agrep, yy$sig,max.distance = levDist,value=T))
#Bring the original text into the partial match list based on the sig key.
xx=merge(xx,yy,by.x='partials',by.y='sig')
#write.csv(xx,'../final.csv')
}
```
| Name | Designation |
|------|--------------------------------|
| Abid | Senior Data Processing Officer |
Fuzzy matching is another technique that allows you to match strings that are similar but not identical. It can be useful in various applications, including data cleaning, data integration, and text analysis. When writing a script, fuzzy matching can be used to:
1. Clean and standardize data: Fuzzy matching can help you identify and correct misspellings, inconsistencies, and other errors in your data. For example, you can use fuzzy matching to match names of people or organizations that are spelled differently in different datasets.
2. Merge data from multiple sources: Fuzzy matching can help you merge data from different sources that have similar but not identical records. For example, you can use fuzzy matching to match records of customers who have the same name but different addresses or phone numbers.
3. Find similar records: Fuzzy matching can help you identify records that are similar to a given record. For example, you can use fuzzy matching to find records that are similar to a given product or customer based on their names, descriptions, or other attributes.
To use fuzzy matching in your script, you can use packages such as **`stringdist`**, **`fuzzyjoin`**, or **`stringr`** in R. These packages provide functions that can be used to calculate string distances, match strings based on their similarity, and extract information from strings. Depending on the specific task you are trying to accomplish, you can choose the appropriate package and function to use.
[^1]Blog-Posts
[^1]: My other blog Posts
[Forget ChatGPT instead... -- Distill (wordpress.com)](https://distillshiny.wordpress.com/2023/04/21/forget-chatgpt-instead/)