This is a tiny (about 6 TB of data, but only 62,489 grid cells of ~100 sqkm) prototype dataset that allows to instantly connect existing Major TOM data with AlphaEarth embeddings.
I curated it to support several relevant research projects, but I figured it could help more people in the community to experiment and explore new applications of AlphaEarth.
𝐃𝐢𝐫𝐞𝐜𝐭𝐢𝐨𝐧𝐬 𝐟𝐨𝐫 𝐔𝐬𝐞 Each embedding sample comes from the original annual dataset produced by Google DeepMind. It means that, unlike samples from Sentinel-2 or Sentinel-1, it contains aggregated annual information from a particular year and is not linked to one particular observation. The existing Major TOM samples from physical sensors provide information potentially (and likely) contained in the AlphaEarth embedding sample, but they miss the temporal component represented within AEF embedding fields.
For more information, please check the dataset card on HuggingFace.
YAML engineering becomes more and more important than ever from infra provisioning to model training (recipes).
Here, I built a simple editor first for @dstackai, and I will share the live endpoint this week. Let me know what you think about this approach.
Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.
Today we release a prototype of COP-GEN - a universal generative model for Copernicus data. 𝐂𝐎𝐏-𝐆𝐄𝐍-𝐁𝐞𝐭𝐚 is a model trained globally on the thumbnails of the Major TOM Core datasets, including Sentinel-2 L1C, Sentinel-2 L2A, Sentinel-1 RTC, and COP-DEM GLO-30.
How is it universal? COP-GEN learns a joint generative process of all modalities, which means that it can reconstruct data from any subset of present observations. 𝐖𝐢𝐭𝐡𝐨𝐮𝐭 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐩𝐞𝐜𝐢𝐟𝐢𝐜𝐚𝐥𝐥𝐲 to perform any of these tasks it can be used to approximate:
✅ Sentinel-1 to Sentinel-2 translation
✅ Elevation estimation from Sentinel-2 or Sentinel-1
✅ Atmospheric Correction (L1C to L2A pipeline)
✅ Atmospheric Generation (L2A to L1C)
✅ ...and any other task involving translation between the supported modalities
On its own, the model can be used as a useful prior for estimating the data likelihood distribution for Copernicus data. COP-GEN-Beta learns joint, conditional, and marginal distributions within a single unified backbone, allowing to flexibly sample any modality given any condition.
Why is it Beta? Because thumbnails are a low-cost representation of the data that scales well and we managed to develop this prototype quite fast. We are currently developing the more costly COP-GEN model that supports the original data. For now, we wanted to showcase the prototype and make it available to the community for a test!
𝐌𝐄𝐒𝐀 🏔️ 𝐓𝐞𝐱𝐭-𝐛𝐚𝐬𝐞𝐝 𝐭𝐞𝐫𝐫𝐚𝐢𝐧 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐨𝐧 𝐦𝐨𝐝𝐞𝐥 MESA is a novel generative model based on latent denoising diffusion capable of generating 2.5D representations (co-registered colour and depth maps) of terrains based on text prompt conditioning.
Work developed by Paul Borne–Pons (@NewtNewt) during his joint internship at Adobe & ESA, and in collaboration with asterisk labs.