A real-time data streaming pipeline that captures live posts from Bluesky regarding the NBA, performs sentiment analysis using Apache Spark Structured Streaming, and visualizes the results in Grafana via InfluxDB.
While watching the NBA playoffs and scrolling through social media, I wondered if I could gauge general fan sentiment in real-time. Originally designed for Twitter, this project now leverages the Bluesky Jetstream API to fetch live posts without rate-limit headaches. The project focuses on building a robust ETL pipeline using Kafka, Spark, and time-series databases.
The pipeline consists of four main stages:
- Ingestion: Python script connects to Bluesky Jetstream (WebSocket) and filters for "NBA" posts in English.
- Buffering: Raw JSON data is pushed to Apache Kafka (Topic:
twitterdata). - Processing: PySpark reads the stream, cleans text, and calculates sentiment (Polarity/Subjectivity) using TextBlob.
- Storage & Viz: Processed data is sent back to Kafka, consumed by a loader script, stored in InfluxDB, and visualized in Grafana.
- Language: Python 3.8+
- Streaming Platform: Apache Kafka & Zookeeper
- Processing Engine: Apache Spark (PySpark 3.5.x)
- Database: InfluxDB (Time Series) & Elasticsearch (NoSQL/Search)
- Visualization: Grafana & Kibana
- Libraries:
websockets,kafka-python,textblob,influxdb-client
- Java (JDK 8 or 11) is required for Kafka and Spark.
- Docker (optional, but recommended for InfluxDB/Grafana).
Create a virtual environment and install the dependencies:
pip install pyspark textblob python-dotenv
pip install kafka-python influxdb-client elasticsearch python-dotenvCreate a .env file in the root directory. This keeps your credentials safe.
# --- KAFKA ---
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
KAFKA_TOPIC=twitterdata
KAFKA_OUTPUT_TOPIC=twitterdata-clean
KAFKA_GROUP_ID=bluesky-group
# --- SPARK ---
SPARK_APP_NAME=BlueskySentimentAnalysis
SPARK_CHECKPOINT_DIR=/tmp/spark-checkpoint
# --- INFLUXDB ---
INFLUXDB_URL=http://localhost:8086
INFLUXDB_TOKEN=YOUR_INFLUX_TOKEN_HERE
INFLUXDB_ORG=YOUR_ORG_NAME
INFLUXDB_BUCKET=YOUR_BUCKET_NAME
# --- ELASTICSEARCH ---
ELASTICSEARCH_HOST=https://localhost:9200
ELASTICSEARCH_USER=elastic
ELASTICSEARCH_PASSWORD=YOUR_GENERATED_PASSWORD
ELASTICSEARCH_INDEX=twitter_datasetbin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties-
Start Elasticsearch:
cd C:\Elastic\elasticsearch-8.x\bin .\elasticsearch.batNote the generated password for the
elasticuser. -
Start Kibana:
cd C:\Elastic\kibana-8.x\bin .\kibana.bat -
Link Kibana:
- Generate enrollment token:
elasticsearch-create-enrollment-token.bat -s kibana - Open
http://localhost:5601and enter the token.
- Generate enrollment token:
-
Setup Index Pattern:
- Go to Stack Management > Data Views.
- Create a data view for index:
twitter_dataset.
Start InfluxDB & Grafana
sudo systemctl start influxdb
sudo systemctl enable influxdb # Pour démarrer automatiquement au boot
sudo systemctl status influxdb
sudo systemctl start grafana-server
sudo systemctl enable grafana-server # Auto-start
sudo systemctl status grafana-server
Run the scripts in the following order using separate terminal tabs:
Step 1: Start the Producer (Ingestion) Connects to Bluesky and sends data to Kafka.
python extract_bluesky_to_kafka.pyStep 2: Start Spark Streaming (Processing)
Reads from Kafka, applies sentiment analysis, and writes back to Kafka.
Note: The script automatically handles the spark-sql-kafka package.
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 process_spark_streaming.pyStep 3: Start the Loader (Storage) Reads from the processed Kafka topic and saves to InfluxDB.
python load_kafka_to_influxdb.pyStep 4: Visualize
- Open Grafana (
http://localhost:3000). - Add InfluxDB as a Data Source (Flux language).
- Create a dashboard to query the measurement
bluesky_post. - Example query to count "Positive" butterflies:
from(bucket: "YOUR_BUCKET") |> range(start: -1h) |> filter(fn: (r) => r["_measurement"] == "bluesky_post") |> filter(fn: (r) => r["polarity_cat"] == "Positive")
Here is the pipeline in action:
The Python Producer filters for posts and pushes them to Kafka.

Processed sentiment data (polarity and subjectivity) stored in the database.

The video below demonstrates the live data flow, showing the sentiment analysis updating in real-time on the Grafana Dashboard.

Exploring the raw tweets and performing full-text search.

- InfluxDB: Optimized for Time Series. It handles high write loads and is perfect for mathematical aggregations over time (e.g., "Average sentiment per minute").
- Elasticsearch: Optimized for Search. It allows us to find specific tweets (e.g., "Find all negative tweets about 'Monarch'").
Kafka acts as the central nervous system, decoupling the ingestion (Bluesky) from the processing (Spark) and the storage (Elastic/Influx). This ensures that if the database loader crashes, no data is lost; it remains buffered in Kafka.
In streaming, "Event Time" (when the post was written) differs from "Processing Time" (when Spark receives it).
- The Problem: What if a post from 12:00 arrives at 12:05 due to network lag?
- The Solution: Spark uses Watermarking.
df.withWatermark("ts", "10 seconds")tells the engine: "Wait for late data up to 10 seconds. Anything older than that is dropped."- This prevents the application from keeping infinite state in memory waiting for old data.
- Append Mode: Only new rows are added to the result table. (Used in this project for writing to Kafka).
- Complete Mode: The entire result table is rewritten every trigger. (Useful for aggregations/counts).
- Update Mode: Only rows that changed are written.
You can use this guide:
- Dockerize: Create a
docker-compose.ymlto launch Kafka, Spark, InfluxDB, and the Python scripts with one command. - Better NLP: Replace
TextBlobwith a pre-trained Transformer model (like BERT) for higher accuracy on slang/sports terms. - Direct Sink: Write from Spark directly to InfluxDB (skipping the second Kafka topic) to reduce latency.
⭐ Si ce projet vous a été utile, n'hésitez pas à lui donner une étoile ! ⭐

