BERT Sentiment Analysis on Amazon Product Review

This project employs the Industrial and Scientific K-core dataset obtained from the Amazon Review Data (2018). This dataset contains reviews, encompassing both ratings and textual content, alongside details about reviewers such as review time, reviewer name, and Unix review time.

The project leverages a pre-trained BERT model to analyze the sentiment of the review text. Following training with the BERT model, the system predicts the sentiment of the test data. To assign sentiment labels, the project transforms product ratings, ranging from 1 to 5, into 0 (Negative), 1 (Neutral), and 2 (Positive).

There are 4 parts in the project:

Data Exploration
Outlier Removal
Data Preprocessing
Model Fine-tuning

0. Loading Dataset

df = pd.read_json("./Industrial_and_Scientific_5.json.gz", lines=True)

# Unnecessary Columns: reviewTime, style, reviewerName, unixReviewTime, image, vote
df = df.drop(["reviewTime", "style", "reviewerName", "unixReviewTime", "image", "summary", "vote"], axis=1)

Load the json file into a dataframe and drop the unnecessary columns.

1. Data Exploration

print(df.isna().sum())
df.dropna(subset=['reviewText'], inplace=True)

# Explore data stats
print('Counts:')
print(df.count(), '\n')
print('Averages:')
print(df.mean(numeric_only=True), '\n')
print('Medians:')
print(df.median(numeric_only=True), '\n')
print('Modes:')
print(df.mode(numeric_only=True).iloc[0], '\n')

Simply Explore the statistics of the datasets and check whether there is any missing value.

# Display distribution of the number of reviews across products
reviews_across_products = df.groupby('overall')['asin'].count()

review_counts = df['overall'].value_counts().sort_index()
print("Number of reviews for each rating in the sample:")
print(review_counts)

plt.figure(figsize=(6, 4))
plt.bar(reviews_across_products.index, reviews_across_products.values, color='skyblue')
plt.xlabel('Ratings')
plt.ylabel('Number of Products')
plt.title('Distribution of Number of Reviews Across Products')
plt.show()

Visualize the distribution of reviews per ratings.

reviews_per_product = df.groupby(['asin', 'overall']).size().unstack(fill_value=0)
reviews_per_product['total_reviews'] = reviews_per_product.sum(axis=1)
print(reviews_per_product)

reviews_per_user = df.groupby(['reviewerID', 'overall']).size().unstack(fill_value=0)
reviews_per_user['total_reviews'] = reviews_per_user.sum(axis=1)
print(reviews_per_user)

Display the distribution of the number of reviews per product and the distribution of reviews per user.

df['review_length'] = df['reviewText'].apply(lambda x: len(x))

plt.figure(figsize=(10, 6))
plt.boxplot(df['review_length'], vert=False)
plt.title('Box Plot of Review Lengths')
plt.xlabel('Review Length')
plt.grid(True)
plt.show()

plt.figure(figsize=(10, 6))
plt.boxplot(df['review_length'], vert=False, showfliers=False)
plt.title('Review Lengths without Outliners')
plt.xlabel('Review Length')
plt.grid(True)
plt.show()

Display the box plot for lengths of reviews to check outliers and show box plot for lengths of reviews without outliner.

df_review = df["reviewText"]
duplicate_rows = df_review.duplicated()
num_duplicates = duplicate_rows.sum()
duplicate_data = df_review[duplicate_rows]
print("There are", num_duplicates, "duplicate rows in the dataset.")
print(duplicate_data)

print("Drop Duplicates")
df.drop_duplicates(subset=['reviewText'], inplace=True)
num_duplicates = df.duplicated().sum()
print("There are", num_duplicates, "duplicate rows in the dataset.")

Check duplicates from the dataset and remove them.

verified_counts = df['verified'].value_counts()
plt.figure(figsize=(6, 4))
verified_counts.plot(kind='bar', color='skyblue')
plt.title('Verified vs Unverified Reviews')
plt.xlabel('Verification')
plt.ylabel('Number of Reviews')
plt.xticks(ticks=[0, 1], labels=['Unverified', 'Verified'], rotation=0)
plt.show()

Visualize the distribution of verified reveiews.

2. Data Preprocessing

Outlier Removal

Q1 = df['review_length'].quantile(0.25)
Q3 = df['review_length'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['review_length'] >= lower_bound) & (df['review_length'] <= upper_bound)]
print(df.shape)

plt.figure(figsize=(10, 6))
plt.boxplot(df['review_length'], vert=False)
plt.title('Box Plot of Review Lengths')
plt.xlabel('Review Length')
plt.grid(True)
plt.show()

sns.boxplot(x='overall', y='review_length', data=df)
plt.title('Boxplot of Review Length for Each Rating (Outliers Removed)')
plt.show()

Identify the outliers by IQR method and remove them from dataset.

Data Transformation

df['overall'] = df['overall'].replace({1: 0, 
                                       2: 0, 
                                       3: 1, 
                                       4: 2, 
                                       5: 2})

Transform the original labels to 0 (Neagtive), 1 (Netural), 2 (Positive)

Random Sampling

df_sample = df.sample(n=1000, random_state=2024)
df_sample = df_sample[['reviewText', 'overall']]
df_train, df_val = train_test_split(df_sample, test_size=0.1, random_state=42)

test_df = df[~df.index.isin(df_sample.index)]
test_df_sample = test_df.sample(n=1000, random_state=2024)
df_test = test_df_sample[['reviewText', 'overall']]

Randomly select 1000 sample and split it into training set and validation set.

Randomly select 1000 sample for testing set.

Tokenization

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize(batch):
    tokenized_inputs = tokenizer(batch['reviewText'], padding=True, truncation=True, max_length=128, return_tensors='pt')
    tokenized_inputs["labels"] = torch.tensor(batch['overall'])
    return tokenized_inputs

train_dataset = Dataset.from_pandas(df_train).map(tokenize, batched=True)
val_dataset = Dataset.from_pandas(df_val).map(tokenize, batched=True)
test_dataset = Dataset.from_pandas(df_test).map(tokenize, batched=True)

train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Tokenize the data using a BERT tokenizer from a pre-trained model and format the datasets for training, validation, and testing using PyTorch.

This prepares the data for training and evaluating a BERT-based model for later tasks.

Model Fine-Tuning

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=len(np.unique(df['overall']))
)

Initializes a BERT-based sequence classification model.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def compute_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)
    
    # Compute accuracy
    accuracy = accuracy_score(labels, preds)
    
    # Compute precision, recall, and F1-score
    macro_precision = precision_score(labels, preds, average='macro')
    macro_recall = recall_score(labels, preds, average='macro')
    macro_f1 = f1_score(labels, preds, average='macro')
    
    return {
        "accuracy": accuracy,
        "precision": macro_precision,
        "recall": macro_recall,
        "f1_score": macro_f1
    }

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

Set up evaluation metrics for accuracy, precision, recall and f1score, and training arguments for training a model.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

Set up a Trainer object for training a model and initiate the training loop, where the model will be trained on the training dataset, evaluated on the evaluation dataset after each epoch.

predictions = trainer.predict(test_dataset)
print(predictions.predictions.shape, predictions.label_ids.shape)

eval_metrics = compute_metrics((predictions.predictions, predictions.label_ids))
print("Accuracy:", eval_metrics["accuracy"])
print("Macro Precision:", eval_metrics["precision"])
print("Macro Recall:", eval_metrics["recall"])
print("Macro F1 Score:", eval_metrics["f1_score"])

Make predictions on the test dataset using the trained model and compute evaluation metrics based on the predictions and true labels. The testing accuracy is around 90%.

Reference:

Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley Empirical Methods in Natural Language Processing (EMNLP), 2019 https://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19a.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
test_trainer/runs		test_trainer/runs
.DS_Store		.DS_Store
BERT_SentimentAnalysis.ipynb		BERT_SentimentAnalysis.ipynb
Industrial_and_Scientific_5.json.gz		Industrial_and_Scientific_5.json.gz
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT Sentiment Analysis on Amazon Product Review

0. Loading Dataset

1. Data Exploration

2. Data Preprocessing

Outlier Removal

Data Transformation

Random Sampling

Tokenization

Model Fine-Tuning

Reference:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BERT Sentiment Analysis on Amazon Product Review

0. Loading Dataset

1. Data Exploration

2. Data Preprocessing

Outlier Removal

Data Transformation

Random Sampling

Tokenization

Model Fine-Tuning

Reference:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages