Spaces:
Sleeping
Sleeping
File size: 6,373 Bytes
27cb60a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
# Final Domain Collection Research
## Summary of Findings
### Available Methods in jao-py
The `JaoPublicationToolPandasClient` class provides three domain query methods:
1. **`query_final_domain(mtu, presolved, cne, co, use_mirror)`** (Line 233)
- Final Computation - Final FB parameters following LTN
- Published: 10:30 D-1
- Most complete dataset (recommended for Phase 2)
2. **`query_prefinal_domain(mtu, presolved, cne, co, use_mirror)`** (Line 248)
- Pre-Final (EarlyPub) - Pre-final FB parameters before LTN
- Published: 08:00 D-1
- Earlier publication time, but before LTN application
3. **`query_initial_domain(mtu, presolved, cne, co)`** (Line 264)
- Initial Computation (Virgin Domain) - Initial flow-based parameters
- Published: Early in D-1
- Before any adjustments
### Method Parameters
```python
def query_final_domain(
mtu: pd.Timestamp, # Market Time Unit (1 hour, timezone-aware)
presolved: bool = None, # Filter: True=binding, False=non-binding, None=ALL
cne: str = None, # CNEC name keyword filter (NOT EIC-based!)
co: str = None, # Contingency keyword filter
use_mirror: bool = False # Use mirror.flowbased.eu for faster bulk download
) -> pd.DataFrame
```
### Key Findings
1. **DENSE Data Acquisition**:
- Set `presolved=None` to get ALL CNECs (binding + non-binding)
- This provides the DENSE format needed for Phase 2 feature engineering
2. **Filtering Limitations**:
- ❌ NO EIC-based filtering on server side
- ✅ Only keyword-based filters (cne, co) available
- **Solution**: Download all CNECs, filter locally by EIC codes
3. **Query Granularity**:
- Method queries **1 hour at a time** (mtu = Market Time Unit)
- For 24 months: Need 17,520 API calls (1 per hour)
- Alternative: Use `use_mirror=True` for whole-day downloads
4. **Mirror Option** (Recommended for bulk collection):
- URL: `https://mirror.flowbased.eu/dacc/final_domain/YYYY-MM-DD`
- Returns full day (24 hours) as CSV in ZIP file
- Much faster than hourly API calls
- Set `use_mirror=True` OR set env var `JAO_USE_MIRROR=1`
5. **Data Structure** (from `parse_final_domain()`):
- Returns pandas DataFrame with columns:
- **Identifiers**: `mtu` (timestamp), `tso`, `cnec_name`, `cnec_eic`, `direction`
- **Contingency**: `contingency_*` fields (nested structure flattened)
- **Presolved field**: Indicates if CNEC is binding (True) or redundant (False)
- **RAM breakdown**: `ram`, `fmax`, `imax`, `frm`, `fuaf`, `amr`, `lta_margin`, etc.
- **PTDFs**: `ptdf_AT`, `ptdf_BE`, ..., `ptdf_SK` (12 Core zones)
- Timestamps converted to Europe/Amsterdam timezone
- snake_case column names (except PTDFs)
### Recommended Implementation for Phase 2
**Option A: Mirror-based (FASTEST)**:
```python
def collect_final_domain_sample(
start_date: str,
end_date: str,
target_cnec_eics: list[str], # 200 EIC codes from Phase 1
output_path: Path
) -> pl.DataFrame:
"""Collect DENSE CNEC data for specific CNECs using mirror."""
client = JAOClient() # With use_mirror=True
all_data = []
for date in pd.date_range(start_date, end_date):
# Query full day (all CNECs) via mirror
df_day = client.query_final_domain(
mtu=pd.Timestamp(date, tz='Europe/Amsterdam'),
presolved=None, # ALL CNECs (DENSE!)
use_mirror=True # Fast bulk download
)
# Filter to target CNECs only
df_filtered = df_day[df_day['cnec_eic'].isin(target_cnec_eics)]
all_data.append(df_filtered)
# Combine and save
df_full = pd.concat(all_data)
pl_df = pl.from_pandas(df_full)
pl_df.write_parquet(output_path)
return pl_df
```
**Option B: Hourly API calls (SLOWER, but more granular)**:
```python
def collect_final_domain_hourly(
start_date: str,
end_date: str,
target_cnec_eics: list[str],
output_path: Path
) -> pl.DataFrame:
"""Collect DENSE CNEC data hour-by-hour."""
client = JAOClient()
all_data = []
for date in pd.date_range(start_date, end_date, freq='H'):
try:
df_hour = client.query_final_domain(
mtu=pd.Timestamp(date, tz='Europe/Amsterdam'),
presolved=None # ALL CNECs
)
df_filtered = df_hour[df_hour['cnec_eic'].isin(target_cnec_eics)]
all_data.append(df_filtered)
except NoMatchingDataError:
continue # Hour may have no data
df_full = pd.concat(all_data)
pl_df = pl.from_pandas(df_full)
pl_df.write_parquet(output_path)
return pl_df
```
### Data Volume Estimates
**Full Download (all ~20K CNECs)**:
- 20,000 CNECs × 17,520 hours = 350M records
- ~27 columns × 8 bytes/value = ~75 GB uncompressed
- Parquet compression: ~10-20 GB
**Filtered (200 target CNECs)**:
- 200 CNECs × 17,520 hours = 3.5M records
- ~27 columns × 8 bytes/value = ~750 MB uncompressed
- Parquet compression: ~100-150 MB
### Implementation Strategy
1. **Phase 1 complete**: Identify top 200 CNECs from SPARSE data
2. **Extract EIC codes**: Save to `data/processed/critical_cnecs_eic_codes.csv`
3. **Test on 1 week**: Validate DENSE collection with mirror
```python
# Test: 2025-09-23 to 2025-09-30 (8 days)
# Expected: 200 CNECs × 192 hours = 38,400 records
```
4. **Collect 24 months**: Using mirror for speed
5. **Validate DENSE structure**:
```python
unique_cnecs = df['cnec_eic'].n_unique()
unique_hours = df['mtu'].n_unique()
expected = unique_cnecs * unique_hours
actual = len(df)
assert actual == expected, f"Not DENSE! {actual} != {expected}"
```
### Advantages of Mirror Method
- ✅ Faster: 1 request/day vs 24 requests/day
- ✅ Rate limit friendly: 730 requests vs 17,520 requests
- ✅ More reliable: Less chance of timeout/connection errors
- ✅ Complete days: Guarantees all 24 hours present
### Next Steps
1. Add `collect_final_domain_dense()` method to `collect_jao.py`
2. Test on 1-week sample with target EIC codes
3. Validate DENSE structure and data quality
4. Run 24-month collection after Phase 1 complete
5. Use DENSE data for Tier 1 & Tier 2 feature engineering
---
**Research completed**: 2025-11-05
**jao-py version**: 0.6.2
**Source**: C:\Users\evgue\projects\fbmc_chronos2\.venv\Lib\site-packages\jao\jao.py
|