Navigation: Main README | Pipeline Guide | Java Architecture | Java Data Sources | TODO
- Overview
- Core Components
- Data Flow
- Project Structure
- Building and Running
- Adding a New Data Source
- Error Handling
- Dependencies
- Related Documentation
The Java module fetches and consolidates financial data from multiple sources into a unified long-format CSV for the Python ML pipeline.
┌────────────────────────────────────────────────────────────────────────────┐
│ JAVA INGESTION ARCHITECTURE │
└────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────┐
│ Main.java │
│ (Entry Point) │
└──────────┬──────────┘
│
│ Registers data sources
▼
┌─────────────────────┐
│ IngestManager │
│ (Singleton) │
│ │
│ • sources: Set │
│ • data: Set │
└──────────┬──────────┘
│
┌────────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ YfPrices │ │ YfFinances │ │ NzGdp │
│ │ │ │ │ │
│ Yahoo OHLCV │ │ Yahoo Fin. │ │ Stats NZ │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└────────────────┼────────────────┘
│
│ Parallel fetch
▼
┌─────────────────────┐
│ Set<DataPoint> │
│ │
│ (timestamp, ticker, │
│ feature, value) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ CsvLongParser │
│ │
│ saveCsv(path) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ data_long.csv │
│ │
│ timestamp,ticker, │
│ feature,value │
└─────────────────────┘
The fundamental data structure:
public class DataPoint {
LocalDateTime timestamp;
String ticker;
String featureName;
Double value;
}Abstract base class for all data sources:
public abstract class DataSourceBase {
public abstract Set<DataPoint> getDataPoints();
public DataSourceBase() {
IngestManager.INSTANCE.sources.add(this);
}
}Singleton that orchestrates data collection:
public enum IngestManager {
INSTANCE;
public Set<DataSourceBase> sources = new HashSet<>();
public Set<DataPoint> data = new HashSet<>();
public void fetchDataFromSources() {
data.clear();
sources.parallelStream().forEach(source -> {
var dataPoints = source.getDataPoints();
dataPoints = dataPoints.stream()
.filter(dp -> dp.getValue() != null)
.collect(Collectors.toSet());
this.data.addAll(dataPoints);
});
}
}Outputs data to CSV:
public class CsvLongParser {
public static void saveCsv(String path) {
// Sort by timestamp, ticker, feature
// Write: timestamp,ticker,feature,value
}
}┌─────────────────────────────────────────────────────────────────────────┐
│ DATA FLOW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. REGISTRATION │
│ ───────────── │
│ new YfPrices(); // Registers with IngestManager │
│ new NzGdp(); // Each source self-registers │
│ ... │
│ │
│ 2. PARALLEL FETCH │
│ ────────────── │
│ IngestManager.fetchDataFromSources() │
│ → sources.parallelStream() → source.getDataPoints() │
│ │
│ 3. DATA POINTS │
│ ──────────── │
│ { │
│ timestamp: 2024-01-15 00:00:00, │
│ ticker: "AIR.NZ", │
│ feature: "Close", │
│ value: 0.65 │
│ } │
│ │
│ 4. CSV OUTPUT │
│ ────────── │
│ timestamp,ticker,feature,value │
│ 1705276800000,AIR.NZ,Close,0.65 │
│ 1705276800000,AIR.NZ,Volume,1234567 │
│ ... │
│ │
└─────────────────────────────────────────────────────────────────────────┘
java/
├── pom.xml # Maven config
├── docs/
│ ├── ARCHITECTURE.md # This file
│ └── DATA_SOURCES.md # Source documentation
└── src/main/java/lazic/
├── Main.java # Entry point
├── sources/
│ ├── config/
│ │ └── Tickers.java # Ticker configuration
│ ├── YfPrices.java # Yahoo Finance prices
│ ├── YfFinances.java # Yahoo Finance financials
│ ├── NzGdp.java # NZ GDP data
│ ├── NzBusinessConfidence.java
│ ├── NzRatesFx.java # NZ rates & FX
│ ├── NzVehicleRegistrations.java
│ ├── NzLaborStats.java
│ ├── NzRoadFatalities.java
│ ├── NzBalanceOfPayments.java
│ ├── NzTaxRevenue.java
│ ├── NzPensions.java
│ ├── NzLaborTaxation.java
│ └── GlobalAquacultureProduction.java
└── utils/
├── ingest/
│ ├── DataPoint.java # Data structure
│ ├── DataSourceBase.java # Base class
│ ├── IngestManager.java # Orchestrator
│ ├── CsvLongParser.java # CSV output
│ └── WebHtmlGetter.java # HTTP client
└── db/
└── ... # Database utilities
- Java 17+
- Maven 3.6+
cd java
mvn clean compilemvn exec:java -Dexec.mainClass="lazic.Main"Or from IDE: Run Main.java
Data is written to data/data_long.csv (relative to project root).
package lazic.sources;
import lazic.utils.ingest.DataPoint;
import lazic.utils.ingest.DataSourceBase;
import java.util.*;
public class MyNewSource extends DataSourceBase {
@Override
public Set<DataPoint> getDataPoints() {
Set<DataPoint> dataPoints = new HashSet<>();
// Fetch data from API/file
String rawData = WebHtmlGetter.get(url);
// Parse and create DataPoints
for (var item : parseData(rawData)) {
dataPoints.add(new DataPoint(
item.getDate(),
item.getTicker(), // or "MACRO_" + featureName for macro data
item.getFeature(),
item.getValue()
));
}
return dataPoints;
}
}public static void main(String[] args) {
// Existing sources
new YfPrices();
new NzGdp();
// Add your source
new MyNewSource();
// Run ingestion
IngestManager.INSTANCE.fetchDataFromSources();
CsvLongParser.saveCsv(outputPath);
}- Null values: Filtered out by IngestManager
- Failed fetches: Logged to stderr, source skipped
- Invalid data: Source-specific validation
<!-- pom.xml -->
<dependencies>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.11.0</version>
</dependency>
</dependencies>- Data Sources — Detailed source configuration
- Main README — Project overview
- Python Pipeline — Data consumer