This project employs the Industrial and Scientific K-core dataset obtained from the Amazon Review Data (2018). This dataset contains reviews, encompassing both ratings and textual content, alongside details about reviewers such as review time, reviewer name, and Unix review time.
The project leverages a pre-trained BERT model to analyze the sentiment of the review text. Following training with the BERT model, the system predicts the sentiment of the test data. To assign sentiment labels, the project transforms product ratings, ranging from 1 to 5, into 0 (Negative), 1 (Neutral), and 2 (Positive).
There are 4 parts in the project:
- Data Exploration
- Outlier Removal
- Data Preprocessing
- Model Fine-tuning
df = pd.read_json("./Industrial_and_Scientific_5.json.gz", lines=True)
# Unnecessary Columns: reviewTime, style, reviewerName, unixReviewTime, image, vote
df = df.drop(["reviewTime", "style", "reviewerName", "unixReviewTime", "image", "summary", "vote"], axis=1)Load the json file into a dataframe and drop the unnecessary columns.
print(df.isna().sum())
df.dropna(subset=['reviewText'], inplace=True)
# Explore data stats
print('Counts:')
print(df.count(), '\n')
print('Averages:')
print(df.mean(numeric_only=True), '\n')
print('Medians:')
print(df.median(numeric_only=True), '\n')
print('Modes:')
print(df.mode(numeric_only=True).iloc[0], '\n')Simply Explore the statistics of the datasets and check whether there is any missing value.
# Display distribution of the number of reviews across products
reviews_across_products = df.groupby('overall')['asin'].count()
review_counts = df['overall'].value_counts().sort_index()
print("Number of reviews for each rating in the sample:")
print(review_counts)
plt.figure(figsize=(6, 4))
plt.bar(reviews_across_products.index, reviews_across_products.values, color='skyblue')
plt.xlabel('Ratings')
plt.ylabel('Number of Products')
plt.title('Distribution of Number of Reviews Across Products')
plt.show()Visualize the distribution of reviews per ratings.
reviews_per_product = df.groupby(['asin', 'overall']).size().unstack(fill_value=0)
reviews_per_product['total_reviews'] = reviews_per_product.sum(axis=1)
print(reviews_per_product)
reviews_per_user = df.groupby(['reviewerID', 'overall']).size().unstack(fill_value=0)
reviews_per_user['total_reviews'] = reviews_per_user.sum(axis=1)
print(reviews_per_user)Display the distribution of the number of reviews per product and the distribution of reviews per user.
df['review_length'] = df['reviewText'].apply(lambda x: len(x))
plt.figure(figsize=(10, 6))
plt.boxplot(df['review_length'], vert=False)
plt.title('Box Plot of Review Lengths')
plt.xlabel('Review Length')
plt.grid(True)
plt.show()
plt.figure(figsize=(10, 6))
plt.boxplot(df['review_length'], vert=False, showfliers=False)
plt.title('Review Lengths without Outliners')
plt.xlabel('Review Length')
plt.grid(True)
plt.show()Display the box plot for lengths of reviews to check outliers and show box plot for lengths of reviews without outliner.
df_review = df["reviewText"]
duplicate_rows = df_review.duplicated()
num_duplicates = duplicate_rows.sum()
duplicate_data = df_review[duplicate_rows]
print("There are", num_duplicates, "duplicate rows in the dataset.")
print(duplicate_data)
print("Drop Duplicates")
df.drop_duplicates(subset=['reviewText'], inplace=True)
num_duplicates = df.duplicated().sum()
print("There are", num_duplicates, "duplicate rows in the dataset.")Check duplicates from the dataset and remove them.
verified_counts = df['verified'].value_counts()
plt.figure(figsize=(6, 4))
verified_counts.plot(kind='bar', color='skyblue')
plt.title('Verified vs Unverified Reviews')
plt.xlabel('Verification')
plt.ylabel('Number of Reviews')
plt.xticks(ticks=[0, 1], labels=['Unverified', 'Verified'], rotation=0)
plt.show()Visualize the distribution of verified reveiews.
Q1 = df['review_length'].quantile(0.25)
Q3 = df['review_length'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['review_length'] >= lower_bound) & (df['review_length'] <= upper_bound)]
print(df.shape)
plt.figure(figsize=(10, 6))
plt.boxplot(df['review_length'], vert=False)
plt.title('Box Plot of Review Lengths')
plt.xlabel('Review Length')
plt.grid(True)
plt.show()
sns.boxplot(x='overall', y='review_length', data=df)
plt.title('Boxplot of Review Length for Each Rating (Outliers Removed)')
plt.show()Identify the outliers by IQR method and remove them from dataset.
df['overall'] = df['overall'].replace({1: 0,
2: 0,
3: 1,
4: 2,
5: 2})Transform the original labels to 0 (Neagtive), 1 (Netural), 2 (Positive)
df_sample = df.sample(n=1000, random_state=2024)
df_sample = df_sample[['reviewText', 'overall']]
df_train, df_val = train_test_split(df_sample, test_size=0.1, random_state=42)
test_df = df[~df.index.isin(df_sample.index)]
test_df_sample = test_df.sample(n=1000, random_state=2024)
df_test = test_df_sample[['reviewText', 'overall']]Randomly select 1000 sample and split it into training set and validation set.
Randomly select 1000 sample for testing set.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def tokenize(batch):
tokenized_inputs = tokenizer(batch['reviewText'], padding=True, truncation=True, max_length=128, return_tensors='pt')
tokenized_inputs["labels"] = torch.tensor(batch['overall'])
return tokenized_inputs
train_dataset = Dataset.from_pandas(df_train).map(tokenize, batched=True)
val_dataset = Dataset.from_pandas(df_val).map(tokenize, batched=True)
test_dataset = Dataset.from_pandas(df_test).map(tokenize, batched=True)
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])Tokenize the data using a BERT tokenizer from a pre-trained model and format the datasets for training, validation, and testing using PyTorch.
This prepares the data for training and evaluating a BERT-based model for later tasks.
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=len(np.unique(df['overall']))
)Initializes a BERT-based sequence classification model.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
def compute_metrics(pred):
logits, labels = pred
preds = np.argmax(logits, axis=-1)
# Compute accuracy
accuracy = accuracy_score(labels, preds)
# Compute precision, recall, and F1-score
macro_precision = precision_score(labels, preds, average='macro')
macro_recall = recall_score(labels, preds, average='macro')
macro_f1 = f1_score(labels, preds, average='macro')
return {
"accuracy": accuracy,
"precision": macro_precision,
"recall": macro_recall,
"f1_score": macro_f1
}
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")Set up evaluation metrics for accuracy, precision, recall and f1score, and training arguments for training a model.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
)
trainer.train()Set up a Trainer object for training a model and initiate the training loop, where the model will be trained on the training dataset, evaluated on the evaluation dataset after each epoch.
predictions = trainer.predict(test_dataset)
print(predictions.predictions.shape, predictions.label_ids.shape)
eval_metrics = compute_metrics((predictions.predictions, predictions.label_ids))
print("Accuracy:", eval_metrics["accuracy"])
print("Macro Precision:", eval_metrics["precision"])
print("Macro Recall:", eval_metrics["recall"])
print("Macro F1 Score:", eval_metrics["f1_score"])Make predictions on the test dataset using the trained model and compute evaluation metrics based on the predictions and true labels. The testing accuracy is around 90%.
Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley Empirical Methods in Natural Language Processing (EMNLP), 2019 https://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19a.pdf





