Skip to content

phrabit/TwitterUserLogAnalysis-Hadoop

Repository files navigation

TwitterUserLogAnalysis-Hadoop

2022 ITM DB_WEB Hadoop Project - User analysis from Twitter log data

Introduction

  • Team5

Data Collecting : Emilie Greeker

Data Engineering : Jooseung Lee

Data Analysis : Suho Lee ✌️

Web : Jaeyou Lee

  • Term

2022 1st semester of ITM Programme 05/01 ~ 05/31

  • Got an A+ grade on this lecture and high score on project as well 💯

✅ Collecting log data from twitter

  • Original Plan

Flume : Connection and Storage Flume Client And Agent Obstacles

  • New Plan

Combination with C#

Microsoft.Net.Http

  • API testing with Postman

image

  • C# Script

Flume Client Substitute

Quick and Effective collection

image

  • Result of log data collection

1018 user log data files

1,000,000 user log data in JSON format.

image

✅ Hadoop Clustering with GCP VM instances

  1. Name Node

  2. Resource Manager

  3. Worker Nodes

image

Installation

hadoop / Java /Setting Environment Variables

Start Spark

  • Used to start the Spark master on the name node. image

  • Used to start a Spark worker on each worker node. image

load

local - worker node3 instance local - put HDFS

image

PySpark

  • Result of Pyspark(Partitioned) => Bringing out to local

image

image

image


✅ Data Analysis

1. Average ‘Tweet count’ by time of day

Based on the analysis of Twitter log data, which indicates that most users upload tweets between 20:00 PM and 24:00 PM.

  • Optimal Timing:

it is recommended to schedule important or high-engagement tweets during this time frame.

  • Targeted Content:

Analyzing the type of content that performs well during this time period can provide insights for creating targeted content

  • Engage with Users:

Monitor relevant hashtags, join conversations, and respond to user queries or comments in real-time.

image

2. tweet_count and followers_count

It was found that most users had a value of 0 in tweet_count and followers_count. This is followed by users with a value of 1-10, followed by users with a value of 100 or greater.

  • Targeting new users:

Users with a tweet_count and followers_count of 0 are identified as users who are not yet actively tweeting or have many followers.

  • Build a small community:

Users with a tweet_count and followers_count of 1-10 have already posted a few tweets and have a small number of followers. A small community can be formed through communication and interaction with these users.

  • Target influential users:

you can encourage their support and sharing through collaboration, partnership, brand mention, etc.

image

3. Location Analysis

prepare to make heat-map for each continent Before done, Location should be converted into 'latitude' and longitude' based on location

image

4. User Language Data Analysis

Using word-cloud

image


✅ Web by using 'Flutter' & 'Firebase'

image

About

2023 ITM DB_WEB Hadoop Project - User analysis from Twitter log data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors