fbmc-chronos2 / doc /final_domain_research.md
Evgueni Poloukarov
feat: complete Phase 1 ENTSO-E asset-specific outage validation
27cb60a
# Final Domain Collection Research
## Summary of Findings
### Available Methods in jao-py
The `JaoPublicationToolPandasClient` class provides three domain query methods:
1. **`query_final_domain(mtu, presolved, cne, co, use_mirror)`** (Line 233)
- Final Computation - Final FB parameters following LTN
- Published: 10:30 D-1
- Most complete dataset (recommended for Phase 2)
2. **`query_prefinal_domain(mtu, presolved, cne, co, use_mirror)`** (Line 248)
- Pre-Final (EarlyPub) - Pre-final FB parameters before LTN
- Published: 08:00 D-1
- Earlier publication time, but before LTN application
3. **`query_initial_domain(mtu, presolved, cne, co)`** (Line 264)
- Initial Computation (Virgin Domain) - Initial flow-based parameters
- Published: Early in D-1
- Before any adjustments
### Method Parameters
```python
def query_final_domain(
mtu: pd.Timestamp, # Market Time Unit (1 hour, timezone-aware)
presolved: bool = None, # Filter: True=binding, False=non-binding, None=ALL
cne: str = None, # CNEC name keyword filter (NOT EIC-based!)
co: str = None, # Contingency keyword filter
use_mirror: bool = False # Use mirror.flowbased.eu for faster bulk download
) -> pd.DataFrame
```
### Key Findings
1. **DENSE Data Acquisition**:
- Set `presolved=None` to get ALL CNECs (binding + non-binding)
- This provides the DENSE format needed for Phase 2 feature engineering
2. **Filtering Limitations**:
- ❌ NO EIC-based filtering on server side
- ✅ Only keyword-based filters (cne, co) available
- **Solution**: Download all CNECs, filter locally by EIC codes
3. **Query Granularity**:
- Method queries **1 hour at a time** (mtu = Market Time Unit)
- For 24 months: Need 17,520 API calls (1 per hour)
- Alternative: Use `use_mirror=True` for whole-day downloads
4. **Mirror Option** (Recommended for bulk collection):
- URL: `https://mirror.flowbased.eu/dacc/final_domain/YYYY-MM-DD`
- Returns full day (24 hours) as CSV in ZIP file
- Much faster than hourly API calls
- Set `use_mirror=True` OR set env var `JAO_USE_MIRROR=1`
5. **Data Structure** (from `parse_final_domain()`):
- Returns pandas DataFrame with columns:
- **Identifiers**: `mtu` (timestamp), `tso`, `cnec_name`, `cnec_eic`, `direction`
- **Contingency**: `contingency_*` fields (nested structure flattened)
- **Presolved field**: Indicates if CNEC is binding (True) or redundant (False)
- **RAM breakdown**: `ram`, `fmax`, `imax`, `frm`, `fuaf`, `amr`, `lta_margin`, etc.
- **PTDFs**: `ptdf_AT`, `ptdf_BE`, ..., `ptdf_SK` (12 Core zones)
- Timestamps converted to Europe/Amsterdam timezone
- snake_case column names (except PTDFs)
### Recommended Implementation for Phase 2
**Option A: Mirror-based (FASTEST)**:
```python
def collect_final_domain_sample(
start_date: str,
end_date: str,
target_cnec_eics: list[str], # 200 EIC codes from Phase 1
output_path: Path
) -> pl.DataFrame:
"""Collect DENSE CNEC data for specific CNECs using mirror."""
client = JAOClient() # With use_mirror=True
all_data = []
for date in pd.date_range(start_date, end_date):
# Query full day (all CNECs) via mirror
df_day = client.query_final_domain(
mtu=pd.Timestamp(date, tz='Europe/Amsterdam'),
presolved=None, # ALL CNECs (DENSE!)
use_mirror=True # Fast bulk download
)
# Filter to target CNECs only
df_filtered = df_day[df_day['cnec_eic'].isin(target_cnec_eics)]
all_data.append(df_filtered)
# Combine and save
df_full = pd.concat(all_data)
pl_df = pl.from_pandas(df_full)
pl_df.write_parquet(output_path)
return pl_df
```
**Option B: Hourly API calls (SLOWER, but more granular)**:
```python
def collect_final_domain_hourly(
start_date: str,
end_date: str,
target_cnec_eics: list[str],
output_path: Path
) -> pl.DataFrame:
"""Collect DENSE CNEC data hour-by-hour."""
client = JAOClient()
all_data = []
for date in pd.date_range(start_date, end_date, freq='H'):
try:
df_hour = client.query_final_domain(
mtu=pd.Timestamp(date, tz='Europe/Amsterdam'),
presolved=None # ALL CNECs
)
df_filtered = df_hour[df_hour['cnec_eic'].isin(target_cnec_eics)]
all_data.append(df_filtered)
except NoMatchingDataError:
continue # Hour may have no data
df_full = pd.concat(all_data)
pl_df = pl.from_pandas(df_full)
pl_df.write_parquet(output_path)
return pl_df
```
### Data Volume Estimates
**Full Download (all ~20K CNECs)**:
- 20,000 CNECs × 17,520 hours = 350M records
- ~27 columns × 8 bytes/value = ~75 GB uncompressed
- Parquet compression: ~10-20 GB
**Filtered (200 target CNECs)**:
- 200 CNECs × 17,520 hours = 3.5M records
- ~27 columns × 8 bytes/value = ~750 MB uncompressed
- Parquet compression: ~100-150 MB
### Implementation Strategy
1. **Phase 1 complete**: Identify top 200 CNECs from SPARSE data
2. **Extract EIC codes**: Save to `data/processed/critical_cnecs_eic_codes.csv`
3. **Test on 1 week**: Validate DENSE collection with mirror
```python
# Test: 2025-09-23 to 2025-09-30 (8 days)
# Expected: 200 CNECs × 192 hours = 38,400 records
```
4. **Collect 24 months**: Using mirror for speed
5. **Validate DENSE structure**:
```python
unique_cnecs = df['cnec_eic'].n_unique()
unique_hours = df['mtu'].n_unique()
expected = unique_cnecs * unique_hours
actual = len(df)
assert actual == expected, f"Not DENSE! {actual} != {expected}"
```
### Advantages of Mirror Method
- ✅ Faster: 1 request/day vs 24 requests/day
- ✅ Rate limit friendly: 730 requests vs 17,520 requests
- ✅ More reliable: Less chance of timeout/connection errors
- ✅ Complete days: Guarantees all 24 hours present
### Next Steps
1. Add `collect_final_domain_dense()` method to `collect_jao.py`
2. Test on 1-week sample with target EIC codes
3. Validate DENSE structure and data quality
4. Run 24-month collection after Phase 1 complete
5. Use DENSE data for Tier 1 & Tier 2 feature engineering
---
**Research completed**: 2025-11-05
**jao-py version**: 0.6.2
**Source**: C:\Users\evgue\projects\fbmc_chronos2\.venv\Lib\site-packages\jao\jao.py