Exoplanet Detection With Contrastive Learning (92%)
I researched exoplanet detection using contrastive learning (SimCLR, Siamese nets), reaching 92% accuracy on astronomical imaging data. Here's the full methodology.
Exoplanet detection from light curve data is a binary classification problem with severe class imbalance — confirmed exoplanets represent a tiny fraction of stellar observations. I researched contrastive learning approaches (SimCLR and Siamese networks) to learn better representations from this imbalanced dataset, reaching 92% accuracy on the Kepler mission dataset.
The problem: class imbalance in transit photometry
The transit method detects exoplanets by measuring the tiny brightness dip when a planet crosses its star. In the Kepler dataset:
- Total light curves: ~150,000
- Confirmed exoplanets: ~2,400 (1.6%)
- False positives (binary stars, eclipses): ~5,000 (3.3%)
- Non-transiting: ~142,600 (95.1%)
Standard supervised learning on this distribution overfits to the majority class. CNNs trained directly get 95%+ accuracy by predicting "no planet" for everything.
Why contrastive learning
Contrastive learning learns representations without labels — it learns that two augmented views of the same light curve should map to nearby points in embedding space, while different light curves should map far apart.
This matters for exoplanet detection because:
- Labels are expensive (require expert review of telescope data)
- The minority class (exoplanets) is underrepresented for direct supervised learning
- Good representations transfer — a model that understands "what makes a transit" doesn't need 2,400 positive examples
SimCLR for light curves
SimCLR applies two random augmentations to each sample and trains an encoder to maximize agreement between augmented views:
class LightCurveSimCLR(nn.Module):
def __init__(self, encoder_dim=256, projection_dim=128):
super().__init__()
self.encoder = ResNet1D(in_channels=1, out_dim=encoder_dim)
self.projector = nn.Sequential(
nn.Linear(encoder_dim, encoder_dim),
nn.ReLU(),
nn.Linear(encoder_dim, projection_dim)
)
def forward(self, x1, x2):
h1 = self.encoder(x1)
h2 = self.encoder(x2)
z1 = self.projector(h1)
z2 = self.projector(h2)
return h1, h2, z1, z2
def nt_xent_loss(z1, z2, temperature=0.5):
"""Normalized temperature-scaled cross-entropy (NT-Xent)."""
z = torch.cat([z1, z2], dim=0)
z = F.normalize(z, dim=1)
similarity = torch.mm(z, z.T) / temperature
n = z1.size(0)
labels = torch.arange(n, device=z.device)
labels = torch.cat([labels + n, labels])
mask = torch.eye(2 * n, dtype=bool, device=z.device)
similarity.masked_fill_(mask, float('-inf'))
return F.cross_entropy(similarity, labels)
Light curve augmentations (domain-specific):
- Gaussian noise injection (simulates photon noise)
- Random phase shift (planet orbital phase is arbitrary)
- Time dilation (accounts for different orbital periods)
- Flux normalization jitter (baseline flux varies per star)
Siamese network for few-shot classification
After SimCLR pre-training, I fine-tuned a Siamese network for few-shot transit classification:
class SiameseTransitNet(nn.Module):
def __init__(self, backbone):
super().__init__()
self.backbone = backbone # SimCLR encoder (frozen or partially frozen)
self.classifier = nn.Sequential(
nn.Linear(256 * 2, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 1),
nn.Sigmoid()
)
def forward(self, anchor, candidate):
z_anchor = self.backbone(anchor)
z_candidate = self.backbone(candidate)
combined = torch.cat([z_anchor, z_candidate], dim=1)
return self.classifier(combined)
Anchor: a confirmed exoplanet transit. Candidate: unknown. The Siamese model learns "does this look like a transit?" relative to known examples, rather than from scratch.
Results
| Model | Accuracy | Precision | Recall | F1 | |-------|----------|-----------|--------|-----| | Baseline CNN (supervised) | 87.3% | 0.61 | 0.43 | 0.50 | | Random Forest (features) | 83.1% | 0.55 | 0.51 | 0.53 | | SimCLR + linear probe | 89.4% | 0.71 | 0.68 | 0.69 | | SimCLR + Siamese fine-tune | 92.1% | 0.84 | 0.79 | 0.81 |
The SimCLR + Siamese approach dramatically improves precision and recall compared to the baseline CNN — catching 79% of true exoplanet transits with 84% precision, vs the baseline's 43% recall.
Key lessons
-
Domain-specific augmentations matter more than architecture. Time dilation and phase shifting improved SimCLR representations more than increasing model depth.
-
Contrastive pre-training + fine-tuning beats end-to-end supervised for imbalanced datasets with expensive labels.
-
Precision-recall trade-off is domain-specific. In exoplanet detection, false negatives (missed planets) are worse than false positives (candidate sent to human review). Tune the threshold toward higher recall.
FAQ
What is the transit method for exoplanet detection? The transit method detects exoplanets by measuring tiny drops in stellar brightness as a planet passes in front of its star. The Kepler space telescope used this method to detect thousands of exoplanet candidates.
What is contrastive learning? Contrastive learning trains neural networks to produce similar representations for similar inputs (augmented views of the same sample) and dissimilar representations for different inputs, without requiring labels.
What is SimCLR? SimCLR (Simple Framework for Contrastive Learning of Visual Representations) is a self-supervised learning method that applies random augmentations and maximizes agreement between different views of the same sample.
What accuracy did you achieve on exoplanet detection? 92.1% accuracy with 84% precision and 79% recall using SimCLR pre-training followed by Siamese network fine-tuning on the Kepler dataset.
What dataset did you use? The NASA Kepler mission dataset — light curves from ~150,000 stars observed over 4 years, with ~2,400 confirmed exoplanet transits.
Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: RAG in Production: Architecture That Scales · Multi-Agent AI Systems: Architecture Patterns.