Spaces:
Sleeping
Sleeping
| # Final Domain Collection Research | |
| ## Summary of Findings | |
| ### Available Methods in jao-py | |
| The `JaoPublicationToolPandasClient` class provides three domain query methods: | |
| 1. **`query_final_domain(mtu, presolved, cne, co, use_mirror)`** (Line 233) | |
| - Final Computation - Final FB parameters following LTN | |
| - Published: 10:30 D-1 | |
| - Most complete dataset (recommended for Phase 2) | |
| 2. **`query_prefinal_domain(mtu, presolved, cne, co, use_mirror)`** (Line 248) | |
| - Pre-Final (EarlyPub) - Pre-final FB parameters before LTN | |
| - Published: 08:00 D-1 | |
| - Earlier publication time, but before LTN application | |
| 3. **`query_initial_domain(mtu, presolved, cne, co)`** (Line 264) | |
| - Initial Computation (Virgin Domain) - Initial flow-based parameters | |
| - Published: Early in D-1 | |
| - Before any adjustments | |
| ### Method Parameters | |
| ```python | |
| def query_final_domain( | |
| mtu: pd.Timestamp, # Market Time Unit (1 hour, timezone-aware) | |
| presolved: bool = None, # Filter: True=binding, False=non-binding, None=ALL | |
| cne: str = None, # CNEC name keyword filter (NOT EIC-based!) | |
| co: str = None, # Contingency keyword filter | |
| use_mirror: bool = False # Use mirror.flowbased.eu for faster bulk download | |
| ) -> pd.DataFrame | |
| ``` | |
| ### Key Findings | |
| 1. **DENSE Data Acquisition**: | |
| - Set `presolved=None` to get ALL CNECs (binding + non-binding) | |
| - This provides the DENSE format needed for Phase 2 feature engineering | |
| 2. **Filtering Limitations**: | |
| - ❌ NO EIC-based filtering on server side | |
| - ✅ Only keyword-based filters (cne, co) available | |
| - **Solution**: Download all CNECs, filter locally by EIC codes | |
| 3. **Query Granularity**: | |
| - Method queries **1 hour at a time** (mtu = Market Time Unit) | |
| - For 24 months: Need 17,520 API calls (1 per hour) | |
| - Alternative: Use `use_mirror=True` for whole-day downloads | |
| 4. **Mirror Option** (Recommended for bulk collection): | |
| - URL: `https://mirror.flowbased.eu/dacc/final_domain/YYYY-MM-DD` | |
| - Returns full day (24 hours) as CSV in ZIP file | |
| - Much faster than hourly API calls | |
| - Set `use_mirror=True` OR set env var `JAO_USE_MIRROR=1` | |
| 5. **Data Structure** (from `parse_final_domain()`): | |
| - Returns pandas DataFrame with columns: | |
| - **Identifiers**: `mtu` (timestamp), `tso`, `cnec_name`, `cnec_eic`, `direction` | |
| - **Contingency**: `contingency_*` fields (nested structure flattened) | |
| - **Presolved field**: Indicates if CNEC is binding (True) or redundant (False) | |
| - **RAM breakdown**: `ram`, `fmax`, `imax`, `frm`, `fuaf`, `amr`, `lta_margin`, etc. | |
| - **PTDFs**: `ptdf_AT`, `ptdf_BE`, ..., `ptdf_SK` (12 Core zones) | |
| - Timestamps converted to Europe/Amsterdam timezone | |
| - snake_case column names (except PTDFs) | |
| ### Recommended Implementation for Phase 2 | |
| **Option A: Mirror-based (FASTEST)**: | |
| ```python | |
| def collect_final_domain_sample( | |
| start_date: str, | |
| end_date: str, | |
| target_cnec_eics: list[str], # 200 EIC codes from Phase 1 | |
| output_path: Path | |
| ) -> pl.DataFrame: | |
| """Collect DENSE CNEC data for specific CNECs using mirror.""" | |
| client = JAOClient() # With use_mirror=True | |
| all_data = [] | |
| for date in pd.date_range(start_date, end_date): | |
| # Query full day (all CNECs) via mirror | |
| df_day = client.query_final_domain( | |
| mtu=pd.Timestamp(date, tz='Europe/Amsterdam'), | |
| presolved=None, # ALL CNECs (DENSE!) | |
| use_mirror=True # Fast bulk download | |
| ) | |
| # Filter to target CNECs only | |
| df_filtered = df_day[df_day['cnec_eic'].isin(target_cnec_eics)] | |
| all_data.append(df_filtered) | |
| # Combine and save | |
| df_full = pd.concat(all_data) | |
| pl_df = pl.from_pandas(df_full) | |
| pl_df.write_parquet(output_path) | |
| return pl_df | |
| ``` | |
| **Option B: Hourly API calls (SLOWER, but more granular)**: | |
| ```python | |
| def collect_final_domain_hourly( | |
| start_date: str, | |
| end_date: str, | |
| target_cnec_eics: list[str], | |
| output_path: Path | |
| ) -> pl.DataFrame: | |
| """Collect DENSE CNEC data hour-by-hour.""" | |
| client = JAOClient() | |
| all_data = [] | |
| for date in pd.date_range(start_date, end_date, freq='H'): | |
| try: | |
| df_hour = client.query_final_domain( | |
| mtu=pd.Timestamp(date, tz='Europe/Amsterdam'), | |
| presolved=None # ALL CNECs | |
| ) | |
| df_filtered = df_hour[df_hour['cnec_eic'].isin(target_cnec_eics)] | |
| all_data.append(df_filtered) | |
| except NoMatchingDataError: | |
| continue # Hour may have no data | |
| df_full = pd.concat(all_data) | |
| pl_df = pl.from_pandas(df_full) | |
| pl_df.write_parquet(output_path) | |
| return pl_df | |
| ``` | |
| ### Data Volume Estimates | |
| **Full Download (all ~20K CNECs)**: | |
| - 20,000 CNECs × 17,520 hours = 350M records | |
| - ~27 columns × 8 bytes/value = ~75 GB uncompressed | |
| - Parquet compression: ~10-20 GB | |
| **Filtered (200 target CNECs)**: | |
| - 200 CNECs × 17,520 hours = 3.5M records | |
| - ~27 columns × 8 bytes/value = ~750 MB uncompressed | |
| - Parquet compression: ~100-150 MB | |
| ### Implementation Strategy | |
| 1. **Phase 1 complete**: Identify top 200 CNECs from SPARSE data | |
| 2. **Extract EIC codes**: Save to `data/processed/critical_cnecs_eic_codes.csv` | |
| 3. **Test on 1 week**: Validate DENSE collection with mirror | |
| ```python | |
| # Test: 2025-09-23 to 2025-09-30 (8 days) | |
| # Expected: 200 CNECs × 192 hours = 38,400 records | |
| ``` | |
| 4. **Collect 24 months**: Using mirror for speed | |
| 5. **Validate DENSE structure**: | |
| ```python | |
| unique_cnecs = df['cnec_eic'].n_unique() | |
| unique_hours = df['mtu'].n_unique() | |
| expected = unique_cnecs * unique_hours | |
| actual = len(df) | |
| assert actual == expected, f"Not DENSE! {actual} != {expected}" | |
| ``` | |
| ### Advantages of Mirror Method | |
| - ✅ Faster: 1 request/day vs 24 requests/day | |
| - ✅ Rate limit friendly: 730 requests vs 17,520 requests | |
| - ✅ More reliable: Less chance of timeout/connection errors | |
| - ✅ Complete days: Guarantees all 24 hours present | |
| ### Next Steps | |
| 1. Add `collect_final_domain_dense()` method to `collect_jao.py` | |
| 2. Test on 1-week sample with target EIC codes | |
| 3. Validate DENSE structure and data quality | |
| 4. Run 24-month collection after Phase 1 complete | |
| 5. Use DENSE data for Tier 1 & Tier 2 feature engineering | |
| --- | |
| **Research completed**: 2025-11-05 | |
| **jao-py version**: 0.6.2 | |
| **Source**: C:\Users\evgue\projects\fbmc_chronos2\.venv\Lib\site-packages\jao\jao.py | |