GitHub

This Python program processes Apache web server log files using Spark (RDD API) and stores structured, uniquely identified records in Elasticsearch. Here’s a detailed breakdown of what it does:

1. Import Libraries

json: For encoding structured data as JSON.
hashlib: For generating unique document IDs using the SHA-224 hash.
re: For regular expressions to parse log lines.

2. Functions

addId(data)
- Serializes a Python dictionary to JSON bytes.
- Computes a SHA-224 hash as a unique identifier (doc_id) for the record.
- Adds doc_id to the dictionary and returns a tuple: (doc_id, serialized_json).
parse(str)
- Uses a precompiled regex to extract fields from an Apache log line:
  - IP address
  - Access date
  - HTTP operation/method
  - Request URI
- Returns a dictionary with these keys: ip, date, operation, uri.

3. Log Pattern

Defines a regex pattern matching the standard Apache combined log format.
Compiles it with re.compile for use in parsing.

4. Spark RDD Pipeline

Reads logs: sc.textFile("/home/ubuntu/walker/apache_logs") loads the raw log file into an RDD, one string per line.
Parses logs: rdd.map(parse) maps each line to a parsed dictionary with the required fields.
Adds IDs: rdd2.map(addID) generates a unique hash-based document ID for each record and serializes it as JSON.

5. Elasticsearch Configuration

Sets up configuration for the Elasticsearch Hadoop connector, specifying:
- Elasticsearch host and port
- Index/resource to write to
- JSON input and mapping id

6. Saving to Elasticsearch

Calls saveAsNewAPIHadoopFile() with special output classes and the configuration to store the transformed records in Elasticsearch, using the generated doc_id as the index key.

Summary

Purpose: Efficiently transform and index large-scale Apache log data into Elasticsearch for querying and analysis.
Process:
1. Parse each log line into a fielded dictionary,
2. Generate a strong, unique document identifier,
3. Serialize each record to JSON,
4. Bulk load these records into Elasticsearch using Spark RDD’s Hadoop API.
Use Case: Enables scalable indexing and searching of log data for analytics, monitoring, or troubleshooting in distributed systems.

Note: The code assumes execution within a PySpark environment and uses the elasticsearch-hadoop connector for Elasticsearch integration.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md
addData.sh		addData.sh
apacheElasticSearch.py		apacheElasticSearch.py
massAdd.py		massAdd.py
parentChild.py		parentChild.py
submitUpdate.sh		submitUpdate.sh
uCampus.py		uCampus.py
updataData.py		updataData.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

1. Import Libraries

2. Functions

3. Log Pattern

4. Spark RDD Pipeline

5. Elasticsearch Configuration

6. Saving to Elasticsearch

Summary

About

Uh oh!

Releases

Packages

Languages

werowe/elasticsearch

Folders and files

Latest commit

History

Repository files navigation

1. Import Libraries

2. Functions

3. Log Pattern

4. Spark RDD Pipeline

5. Elasticsearch Configuration

6. Saving to Elasticsearch

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages