As our collections of phages grow, keeping track of our phages will become increasingly important. As we run more experiments across our growing phage collections, the trove of data we collect about our phages will increase exponentially. In this highly informal, conversational series, I’ll explore ideas on how we could rein in all that phage, microbial, and experimental data before it gets out of control.
In this first piece, I’ll introduce myself, and cover the idea of why we should track phages, what to track, and roughly how to track them.
Let me start by briefly introducing my background:
I come from the world of computer science and “information architecture” (Read: What is Information Architecture?). I’ve worked on many projects at companies like L’Oreal to make complex data problems more manageable and understandable.
Recently, I’ve been building systems that help Phage Australia manage biobanking and clinical/characterization/experimental phage data. This data will in turn help clinicians and microbiologists make decisions that relate to phage therapy.
Throughout the year, I will write a series of informal posts about designing and building databases and web applications for managing phage data.
As more labs are starting their own biobanks and phage databases, I hope that my explorations could prove useful. Much of the work is iterative and works-in-progress, and I’ll probably get things wrong along the way. I will also probably have more questions than answers. I welcome any suggestions, tips and corrections — send them to me at firstname.lastname@example.org.
Why track our phages?
When we work with just a few phages, we might not need a “system” — a lab notebook will suffice. Even then, our phages should have names, and retrieving experimental and bioinformatics data should be effortless.
Once we have more than a few dozen phages, or once we’ve decided to create a phage biobank, the amount of data and experiments we collect about our phages will grow exponentially. Once we start running a large number of experiments, it might get easier to lose track of our phages, along with all the experimental data.
Knowing where our phages and data exist is also important when students graduate and fresh ones arrive. Without a system, our lab could lose several weeks of lab time. Experimental data collected by former students from months to years ago, might take several hours to several days for new students to find. Some data might even be lost forever.
Knowing where experimental data exists on our phages also helps speed up our data analysis, lab work, and publishing cadence. If anything, it would free up time to conduct more experiments. Without a system, we could easily spend much of our time tracking down images of plates/TEMs, charts and graphs, collections of sequencing data, and multiple versions of Excel files all labeled “final”, spread across several lab computers… for every single phage. If we had trouble finding data for 50 phages, imagine how difficult it could get to find data for 500 or 5,000 phages.
Keeping phage data clearly organized could also speed up onboarding collaborators. Team members could quickly find and share relevant experimental data, without back-and-forth emailing and requesting for permissions.
How to think about “how to think about” our phages
In computing, we have this concept of “data”. Data just means a collection of information about some thing. There’s also the concept of “metadata” which is the data that describes the data. A passport will for example have attributes like first, middle and last names. It might have also have gender, hair length, or hair color. These attributes make up the metadata of a passport.
In the phage world, metadata describes what we collect about a phage, like “plaque size” and “host range”. Data describes what we know about the phage, like ”1 mm”, and ”kills S. aureus”.
When we consider “what do we need to know?” about our phages, we are thinking about what metadata we want to collect about our phages. We call the collection of metadata that we choose to describe our phages as our “phage schema”.
Where did they come from, where do they go?
The most fundamental qualities we want to know about our phages are probably “what do we call our phage” and “where do we find our phage” — the “name” and “location” of the phage. We might also want to know “who’s the person in charge of the phage”, “is it currently expired? (e.g. does it need to be re-amplified?)” and “where and when was it isolated?”. We probably will want to know either “where is the data on our phage” or at least, “who has the experimental data on our phage”.
We probably also want to know which bacterial strain was used to isolate the phage, and the number (and results) of experiments that have been conducted on the phage.
Going deeper, we might consider the intellectual property, material transfer agreements and usage rights of the phage (e.g. what can and can’t we do with a phage, which we received from another lab).
Oh, and we should probably also track things like “who sent us the phage” and “who we sent our phage to” — along with the agreements and restrictions we put in place for the receiving labs. And also “where did we publish about this phage”. The list of metadata that we could collect goes on and on.
The most crucial piece of information we need about our phage however is its identity. What makes a phage a phage, and when should we consider a “slightly different phage” an altogether different phage? How should we think about how these two phages relate? Are they ancestors, twins, siblings, or cousins twice removed?
How we decide on how we identify a phage will depend on what aspects of our phages we care about.
What do we track about our phages?
What a lab tracks about its phages depends on what the lab cares about. An ecology lab’s interests and needs will differ from those of a plant pathogen lab or a phage therapy lab.
Every lab should divide what they know about their phages into two main categories: Core Identifying Characteristics and Conditional/Mutable Characteristics.
Core Identifying Characteristics define “what makes this phage unique”. These should answer questions like “what are its defining characteristics”, “where did it come from”, “where is it", "where/how can I get it”, “who is responsible for it”, and “is it expired”.
Core Identifying Characteristics are the things printed on our passports, driver’s licenses and social security / tax ID cards. These are essential, as they help us find the phage, which we could use to derive all of its other characteristics. Clearly being able to identify a phage is also necessary for communicating about it in papers and social media, and for describing/comparing/sharing the phage.
Conditional or Mutable Characteristics are attributes that might or might not hold true under all circumstances. This is like my current hair color, or my current friends and coworkers. These can change over time. They can have different magnitudes. They can be different in various conditions, be temporary or permanent, or only exist relative to other objects like strains, phages, and antibiotics.
When establishing our phage’s identity, there’s a “gray area” around what actually defines a phage’s identity. For example, my hair color is on my passport. If I dye it blue tomorrow, does that make me a different person? Does it make my passport invalid? Similarly, my friends and coworkers would never appear on my passport — as those would fluctuate over time.
Similarly, if I learned a new skill tomorrow, does that make me a different person? If I know how to ski, but I’m at the beach and haven’t been observed skiing — would I still be a skier? Would “skier” be a part of my identity?
In the phage world, we classify phages based on what we observe. These observations are sometimes conditional, mutable characteristics. Sometimes they’re abilities that depend on our observation conditions. Other times, they could be core identifying characteristics.
What we define as a phage’s identity depends on what we care about. If we cared about friends, family and coworkers, we’d absolutely define that as part of a person’s identity (Facebook and LinkedIn do this). If we cared about someone’s skiing abilities, we’d define that as part someone’s identity.
When building out our phage data system for Phage Australia, I think of a phage’s name, isolation host, origin (e.g. where was it found, or what lab did it come from), and sequencing information as Core characteristics.
If a phage is engineered or evolved to be “significantly different” from another phage, I would try to determine its degrees of similarity to other phages in the database. This could potentially phenotypic data, but most likely I would rely on the differences in the genome sequence data. And of course, “significantly different” is subjective.
Any observed characteristics and abilities that can change based on conditions and relationships (e.g. host bacteria) I would classify as “conditional or mutable characteristics”. Some characteristics or abilities could be considered “temporary” or “impermanent”. These would also be considered “conditional” and not part of a phage’s core identity.
Characteristics like host range, growth curves, antibiotic synergy, and even morphology data like plaque size would depend on a phage’s observed conditions. Some conditions like temperature, media, and presence of bacteria and antibiotics could change our observed characteristics. It’s important to note that many of these observations occur relative to other entities like antibiotics and bacteria.
Take host range for example. Instead of a list of strains that a phage could lyse, I would want to know its plaque morphologies on the strains it had been tested against. This gives me a better picture of the full range of hosts, and the various degrees of potency against strains that phage has been tested against.
|Core Identifying Characteristics
|Phage Identifier / Accession ID
||Plaque info (e.g. plaque morphology)
|Origin / discovery info
||Other experimental data
|Material transfer agreements
|Intellectual property info
What do we care about?
How we identify our phages depends entirely on what we care about. For Phage Australia, our host ranges will always be derived from successful and unsuccessful plaque results, and the underlying conditions.
Just like listing “natural hair color” might or might not make sense on a passport, “natural host range” might or might not make sense on a phage passport. It just depends, and there are no obvious answers.
However, how we define a phage’s identity affects how we record data about a phage. For example, Sydney is known as a “sunny” city. Except it’s the rainiest city I have ever lived in. Since moving here almost three weeks ago, it’s rained almost every single day. Flash floods have washed away roads and caused evacuations (thankfully we’re fine). Should I classify Sydney as “Rainy” or “Sunny”? Or do I instead not call it anything, and just show the number of days that it’s rained vs. the number of days it’s been sunny? Do I record the amount of rain per day? Or do I not talk about the weather at all?
As we develop better sequence analysis tools and characterize more phage genes, we’ll get a better understanding of both core and conditional/mutable phage characteristics. While genome size is probably an identifying characteristic, a phage’s known antibiotic resistance genes will probably change over time.
Additionally, how do we classify engineered, mutated and synthetic phages?
Creating identities around phages is really hard. There are many characteristics we could use to define the identity of our phages. Which ones we choose will depend on our lab and funding needs.
I previously mentioned that establishing the “identity” of our phage is crucial. And the most crucial characteristic of a phage’s identity is its Name. A name enables us to access, discuss and communicate our phage. But most importantly, a name lets us compare our phage against other phages.
In our next post, we’ll geek out on the importance of both generic and memorable phage names, the many ways we could name our phage, and explore why comparing phages is necessary.
Special thanks to Jessica Sacher, Evelien Adriaenssens, and the Phage Australia team (Ruby Lin, Nouri Ben Zakour, Stephanie Lynch, Jon Iredell) for helping me hash some of these ideas out.
More special thanks to various phage labs and biobanks we’ve spoken to over the years about data management. Some of these labs include: Queen Astrid Military Hospital, Sciensano, the Félix d’Hérelle Reference Center for Bacterial Viruses, DSMZ, ATCC, NCTC, TAILOR, Israeli Phage Bank, The Bacteriophage Bank of Korea, Fagenbank, Citizen Phage Library, Japan Phage Bank, and many more, throughout the years. Thanks so much for putting up with my incessant questioning!
Thanks so much to Steph Lynch and Atif Khan for handling the Updates/Jobs/Community section this week (and most weeks!!)