Skip to content

werowe/elasticsearch

Repository files navigation

This Python program processes Apache web server log files using Spark (RDD API) and stores structured, uniquely identified records in Elasticsearch. Here’s a detailed breakdown of what it does:


1. Import Libraries

  • json: For encoding structured data as JSON.
  • hashlib: For generating unique document IDs using the SHA-224 hash.
  • re: For regular expressions to parse log lines.

2. Functions

  • addId(data)

    • Serializes a Python dictionary to JSON bytes.
    • Computes a SHA-224 hash as a unique identifier (doc_id) for the record.
    • Adds doc_id to the dictionary and returns a tuple: (doc_id, serialized_json).
  • parse(str)

    • Uses a precompiled regex to extract fields from an Apache log line:
      • IP address
      • Access date
      • HTTP operation/method
      • Request URI
    • Returns a dictionary with these keys: ip, date, operation, uri.

3. Log Pattern

  • Defines a regex pattern matching the standard Apache combined log format.
  • Compiles it with re.compile for use in parsing.

4. Spark RDD Pipeline

  • Reads logs: sc.textFile("/home/ubuntu/walker/apache_logs") loads the raw log file into an RDD, one string per line.
  • Parses logs: rdd.map(parse) maps each line to a parsed dictionary with the required fields.
  • Adds IDs: rdd2.map(addID) generates a unique hash-based document ID for each record and serializes it as JSON.

5. Elasticsearch Configuration

  • Sets up configuration for the Elasticsearch Hadoop connector, specifying:
    • Elasticsearch host and port
    • Index/resource to write to
    • JSON input and mapping id

6. Saving to Elasticsearch

  • Calls saveAsNewAPIHadoopFile() with special output classes and the configuration to store the transformed records in Elasticsearch, using the generated doc_id as the index key.

Summary

  • Purpose: Efficiently transform and index large-scale Apache log data into Elasticsearch for querying and analysis.

  • Process:

    1. Parse each log line into a fielded dictionary,
    2. Generate a strong, unique document identifier,
    3. Serialize each record to JSON,
    4. Bulk load these records into Elasticsearch using Spark RDD’s Hadoop API.
  • Use Case: Enables scalable indexing and searching of log data for analytics, monitoring, or troubleshooting in distributed systems.

Note: The code assumes execution within a PySpark environment and uses the elasticsearch-hadoop connector for Elasticsearch integration.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published