Breaking down barriers in Longevity x AI research with a $10k Hackathon!

Or, why wrangling biological datasets onto HuggingFace is nontrivial

and

Jun 17, 2025

Me presenting about Seanome to the lovely Biopunk Hackathon attendees! Photo Credit: Elliot Roth.

Last weekend, Seanome hosted a Longevity x AI Hackathon with Frontier Tower (Thanks Elliot Roth!). I was part of a team (along with JJ Hales and Karl Pfleger) that won $10k from Foresight Institute’s Vision Weekend in December 2024. We were the Longevity/Biotech team competing against Neurotech, Nanotech, AI Safety, Space, Existential Hope for $10k to accelerate our field. Our team won with the premise of breaking down barriers for AI/ML in aging: getting molecular profiling datasets of biological aging (especially those curated by Karl and Francesco Neri in the Aging Biotech database!) into the hands of AI/ML practitioners. Originally, we wanted a Kaggle Competition, but now, to run a competition, you need $100k (!) and thus we were priced out with our measly $10k. We brainstormed and came up with a hackathon!

For this hackathon, our goal was to upload as many aging datasets as possible to HuggingFace, a popular AI/ML platform. Of course, someone asked …

“Uploading datasets? Isn’t that trivial?”

I responded, “… I think you underestimate the complexity of biological datasets.”

Currently, working with a biological dataset requires know-how for (1) locating the data, (2) understanding its meaning (e.g. row and column definitions), and (3) its potential uses (e.g. does it indicate gene activation? Can it identify pathways?). With Karl’s curation, (1) is somewhat solved, but 2-3 are still challenging. You need to read the paper and dig into supplementary files to know what is going on. Datasets on cellxgene (”cell-by-gene”) are MUCH easier to use as they are formatted and conform to a schema, but still take know-how (I’ve heard it can take engineering teams ~months to ingest them!). But most biological data is NOT on cellxgene, but on National Center for Biotechnology Information’s (NCBI) Gene Expression Omnibus (GEO). If not, maybe the author made the data available as a supplementary table, if you’re lucky! If you’re unlucky, you need to email the author to get your hands on the data :(

Datasets on cellxgene still need formatting to get them HuggingFace ready. See seanome/2025-longevity-x-ai-hackathon for example data cleaning notebooks.

I had low expectations: more attendees than me and the organizers, and at least 1 dataset uploaded. We crushed them! Over two days, we had:

16 participants! (The winning group was split across two continents: SF to London, wow!)
36 longevity molecular profiling datasets uploaded to HuggingFace!
166.6 GB uploaded to HuggingFace!
$7k awarded to 1st ($4k), 2nd ($2k), and 3rd ($1k) place teams!

Thank you to our judges Zoe Isabel Senón, JJ Hales, William Poole, and Aabha Morey for lending their expertise and helping select our winners!

Our excellent judges for the Longevity x AI Hackathon! From left to right: myself, Zoe, JJ, William, and Aabha.

Things I’d do differently for the next hackathon

This is my first time running a cash prize hackathon, and it’s different from the nf-core hackathons I’ve run before.

Open by default

Our hackathon goal was to contribute longevity datasets to HuggingFace. However, to advance the field, we also need to share the code that created the uploaded data. It’s not enough to share the data; we also want to share the code, as users will want to see how it was prepared, and it will help others in the future as examples of cleaning up the data! Later, we added the requirement to share the processing code, which wasn’t compatible with all participants, and we can be clearer about this in the future.

I’d be clearer about another requirement: for the HuggingFace dataset repository to link to the GitHub repository and vice versa, to help future users understand the data processing and for coders to see its HuggingFace format.

Invite more people!

We budgeted food for ~40 people and in my naiveté, I thought inviting 40 meant 40 would come! But no…. San Francisco has a LOT going on, and I was told to expect 50% (!) to show up. I ignored them and kept the total count at 40, asking people to decline if they couldn’t make it anymore. We had 14/40 (35%) in-person participants, which is on par for the 30-60% I’ve heard for events in SF. If I ran this again, I’d have a 100-person guest list and advertise more widely and earlier.

Teams

I don’t mind if you code in a room by yourself outside of a hackathon. HOWEVER, the point of a hackathon is to meet other people! We had a few going solo, and one big (5-person) team, and 2-3 in a team is a nice sweet spot, because you’re still collaborating but not too many that the coordination cost gets too high. Next time, I would state directly in the event text that you’re required to work in a team of 2-3, and we won’t judge submissions from a single individual.

Overall, I’m grateful to the opportunity from Foresight Institute to meet the lovely hackathon attendees and people of the Frontier Tower! Looking forward to more events there!

A guest post by

Olga Botvinnik

Founder/CEO, Seanome. Peppy computational biologist

Seanome’s Substack

Discussion about this post