职位描述
About the Role
We are building next-generation language models for life sciences.
This internship focuses on developing and evaluating large-scale sequence models that can learn rich representations across DNA, RNA, amino acids, small molecules, and human-curated biomedical knowledge.
You will explore multi-modal and multi-sequence fusion, helping to bridge biological data with domain knowledge using foundation model techniques.
What You'll Do
Design and fine-tune transformer-based language models for biological and chemical sequences.
Integrate multiple biological modalities: genomic sequences, protein sequences, SMILES strings, and related annotations.
Build and experiment with methods for cross-modal embedding, contrastive learning, or retrieval.
Analyze biological sequence datasets (e.g., genome references, UniProt, PubChem).
Collaborate with a multi-disciplinary team of ML engineers, bioinformaticians, and computational biologists.
Who You Are
✅ Must-Have
Familiarity with large language models, transformer architectures, and sequence modeling.
Hands-on experience with PyTorch, TensorFlow, or JAX.
Solid coding skills for data preprocessing and model training.
Basic understanding of biological sequences (DNA, RNA, proteins) or chemical representations (SMILES).
✅ Nice-to-Have
Experience training or fine-tuning LLMs with domain-specific corpora.
Exposure to multi-modal learning techniques.
Knowledge of relevant open datasets: PDB, UniProt, GenBank, or chemical libraries.
Basic grasp of computational biology or cheminformatics concepts.
Familiarity with NLP evaluation pipelines and retrieval tasks.
✅ Mindset
You are curious about applying LLMs to new scientific frontiers.
You enjoy learning from both ML and life science literature.
You’re comfortable working in an interdisciplinary research setting.
Why Join Us
Be part of a team pushing the boundaries of LLMs for biology and chemistry.
Access high-performance computing and cutting-edge models.
Collaborate with experts at the intersection of NLP, bioinformatics, and drug discovery.
Contribute to open science and next-generation bio-foundation models.