Skip to content

Getting inconsistent results with Bert #3

@scovail

Description

@scovail

I've gotten inconsistent results trying to generate sentence vectors using Bert, which is causing the cosine distance calculation to be incorrect. When I make two calls to vectorizer.bert and pass a list each time, then calculate cosine distance for the matching pairs, sentences that are identical are not being identified as such (row 1 of the output). However, when I pass the two identical strings as a list and then compare the vectors that were generated, the results are correct (row 2). In the example below, strings 0, 1, 3, 4, 5 and 6 are identical and should have a cosine distance of 0.

In [9]: import pandas as pd
...: from scipy.spatial.distance import cosine
...: from sent2vec.vectorizer import Vectorizer
...:
...: vectorizer = Vectorizer()
...:
...: df=pd.read_csv('temp2.csv', names=['String A', 'String B'])
...: newa = []
...: newb = []
...:
...: for index, row in df.iterrows():
...: newa.append(row['String A'])
...: newb.append(row['String B'])
...:
...: x=newa[50:60]
...: y=newb[50:60]
...:
...: vectorizer.bert(x)
...: xvex=vectorizer.vectors
...: vectorizer.bert(y)
...: yvex=vectorizer.vectors
...:
...: for i in range(0,10):
...: flag = x[i] == y[i]
...: print('\n({:d}) {:45s} {:45s} {:f}'.format(i, x[i], y[i], cosine(xvex[i], yvex[i])))
...: vectorizer.bert([x[i], y[i]])
...: print('({:d}) {:45s} {:45s} {:4f}'.format(i, x[i], y[i], cosine(vectorizer.vectors[0], vectorizer.vectors[1])))

(0) String A: 401k retirement String B: 401k retirement
(1) String A: 401k retirement accounts String B: 401k retirement accounts
(2) String A: 401k retirement funds String B: 401k retirement plans
(3) String A: 401k retirement investing String B: 401k retirement investing
(4) String A: 401k retirement plan String B: 401k retirement plan
(5) String A: 401k retirement plans String B: 401k retirement plans
(6) String A: 401k retirement savings String B: 401k retirement savings
(7) String A: 401k retirement savings plan String B: 401k plan retirement
(8) String A: 401k retirement savings plan String B: 401k retirement plans
(9) String A: 401k retirement services String B: 401k plan retirement

(0) 401k retirement 401k retirement 0.002897
(0) 401k retirement 401k retirement 0.000000

(1) 401k retirement accounts 401k retirement accounts 0.006706
(1) 401k retirement accounts 401k retirement accounts 0.000000

(2) 401k retirement funds 401k retirement plans 0.012481
(2) 401k retirement funds 401k retirement plans 0.013344

(3) 401k retirement investing 401k retirement investing 0.004481
(3) 401k retirement investing 401k retirement investing 0.000000

(4) 401k retirement plan 401k retirement plan 0.006325
(4) 401k retirement plan 401k retirement plan 0.000000

(5) 401k retirement plans 401k retirement plans 0.005616
(5) 401k retirement plans 401k retirement plans 0.000000

(6) 401k retirement savings 401k retirement savings 0.006093
(6) 401k retirement savings 401k retirement savings 0.000000

(7) 401k retirement savings plan 401k plan retirement 0.013586
(7) 401k retirement savings plan 401k plan retirement 0.023076

(8) 401k retirement savings plan 401k retirement plans 0.008529
(8) 401k retirement savings plan 401k retirement plans 0.017313

(9) 401k retirement services 401k plan retirement 0.017170
(9) 401k retirement services 401k plan retirement 0.014167

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions