10 skills you’ll learn as a Bioinformatician

Feb 1, 2021

Transitioning into a new job working with genomics data has been a dream. Having previously worked as a data analyst and a software engineer where I wasn’t particularly passionate about the data or the projects, those hours spent solving problems could be draining, especially when I wasn’t interested in the big picture.

Bioinformatics image

Now that I’m working with metagenomics data and the big picture is the microbiome, every day is different and the process is genuinely fulfilling.

I’ve read that there is a huge overlap in the skills required for a career in data science and bioinformatics so I decided to make a list of new challenges and software I’ve encountered, as well as the differences in how I approach the analysis.

The demand for bioinformatics skills is only going to grow given that the amount of sequencing data being generated faster than we can make sense of it, especially in concert with future aspirations of personalized medicine.

Whether you are a PhD learning bioinformatics along the way, a new joiner on a bioinformatics team, or are considering a change to a new field, there are probably a few things you’ve encountered (or will encounter) while working with genomics data.

1. Managing Large Files… and Lots of Them

On previous projects as an analyst, I was usually only working on one dataset or CSV file at a time. Sometimes a few which I would generally incorporate into one dataset. The code I was writing was pulling data from multiple tables or databases to create one dataframe.

These days there isn’t a SQL query in sight, as all the files are sitting on the server. Genetic sequences don’t really suit a tabular format…

Server screenshot

In a bioinformatics pipeline you have to work on multiple files simultaneously. For each subject or pool in the experiment there are multiple samples which may reflect different time points, treatments etc.

This means everything is done in bulk. You need to be careful that your program or pipeline has run on every single file and you haven’t missed any during the processing. This calls for immense organization — in your workspace and workflow.

Maybe your program will fail on one file in your folder of samples and cause you trouble at a later stage. Making spreadsheets can help to keep tabs on files at different steps of the pipeline.

Having solid and consistent naming conventions for your folders and files is always important but I find it to be especially necessary to keep track of all the files and their locations.

The files also happen to be huge. Which brings me to my next point…

2. Bash Scripting

The FASTQ files on the server are huge. A 300GB folder of bacterial sequences is too large to work on locally. Hacking together a small bash script is an extremely powerful and quick way to perform an operation on every file in a folder.

In fact it is the only way.

Maybe you need to add a prefix to every single node ID in an 8 million line file or need to count the number or length of contigs in multiple files. At some stage you’ll need to generate a batch file of commands which will run on each file in a folder. This is where a bash script comes into play.

Writing programs in vim or nano are necessary to handle and manipulate large swathes of files. for loops, while loops, read loops, and piping output from one command to another become your sole focus.

Data scientists wrangling big data are probably accustomed to the occasional bash script but for me this was new.

3. Increased Repertoire of Linux Commands

I have definitely increased my understanding of Linux command-line utilities from spending more time rooting around on the server. My most used ones are probably:

The trusted grep — a go-to to find snippets of text in files, to count the number of contigs or sequences in a fasta file or to count unique samples in a file:

grep -c "^>" viral_sequences.fasta

I would definitely recommend learning regular expression syntax and getting comfortable with accessing and changing strings based on a pattern.

awk – a full programming language within Linux and very efficient for working with text
find – recursively finds files with a certain prefix or name
sed – another powerful text processing command

Familiarizing yourself with regular expressions in general is extremely important.

4. Knowing Your Server

Data scientists with cloud computing experience are no stranger to connecting to a remote server or VM instance. If you are working on a couple of files locally you probably won’t need to enter your terminal apart from starting an IDE or Jupyter notebook.

In no time you’ll be able to find your way around a terminal blindfolded. An absolute lifesaver is using Remote-SSH in Visual Studio Code to connect to your server.

Something I had to learn was server etiquette. If you are sharing a server, there is likely a process or thread limit per user to prevent overload.

Useful commands:

htop
htop -u linda
pkill <process>
df -h
du -h <foldername>
scp file user@server:path

It’s also best practice to be conservative about space usage.

5. Understanding What Tools to Apply to Your Data

Biological problems are complex and layered. Coming from financial data where you calculate rolling averages or time-series percentages, sequencing data involves dense, dependent processing steps.

Now I understand what a “pipeline” truly means — a long list of programs that you push your data through, transforming it step-by-step until you generate a result.

Below is a great flow chart outlining a generic metagenomics workflow:

Metagenomics overview

Metagenomics is just one facet of bioinformatics. There is also transcriptomics (RNA-Seq), phylogenomics, proteomics, pangenomics and more — each with its own pipelines and tools.

6. Statistics

The statistics involved are far more extensive than what I previously encountered.

Some examples:

Normalization and rarefaction
Alpha diversity (Shannon entropy, Simpson’s index, Chao, ACE)
PCoA / NMDS plots
PERMANOVA
Differential gene expression (log2 fold change)
ANOVA, t-test
Kruskal-Wallis
P-value correction
Machine learning model validation
E-values in sequence alignment

I’m not a statistics wizard by any means but am working toward understanding when and how to apply these methods correctly.

7. R

A bioinformatician’s weapon of choice is R. Many statistical packages were written specifically for genomics analysis such as DESeq2 and phyloseq. Bioconductor contains a full bioinformatics toolkit.

Coming from Python, the functional syntax of R can feel different from object-oriented programming. However, the statistical depth and the power of ggplot2 for publication-quality figures make it incredibly valuable.

A lot of bioinformatics software is also written in Python (e.g. MetaPhlAn2), so being proficient in Python is equally valuable. The BioPython module can be useful for quickly processing fasta files.

8. Data Visualization

If you enjoy data visualization, you will love bioinformatics.

Boxplots, PCoA, NMDS, MA plots, violin plots, volcano plots, heatmaps, phylogenetic trees — the diversity is enormous. It’s a data scientist’s dream.

9. Cloud Computing

If your workplace doesn’t have a bare-metal server, you will likely be connecting to a VM instance in the cloud.

AWS and GCP are popular platforms, both with command-line tools.

Even if you have local resources, backing up your work on independent cloud storage (like Google Cloud Storage) is always a good idea.

10. Interpreting Biological Context and Experimental Design

You are often handed the result of someone’s weeks or months of lab work to analyze. That requires understanding the experimental design and biological context.

You’ll find yourself reading scientific literature, learning about instruments and wet lab methods, and becoming deeply curious about the biology behind the data.

Being exposed to many different projects and experiments makes the job incredibly interesting and varied.