Transitioning into a new job working with genomics data has been a dream. Having previously worked as a data analyst and a software engineer where I wasn’t particularly passionate about the data or the projects, those hours spent solving problems could be draining, especially when I wasn’t interested in the big picture.
Now that I’m working with metagenomics data and the big picture is the microbiome, every day is different and the process is genuinely fulfilling.
I’ve read that there is a huge overlap in the skills required for a career in data science and bioinformatics so I decided to make a list of new challenges and software I’ve encountered, as well as the differences in how I approach the analysis.
The demand for bioinformatics skills is only going to grow given that the amount of sequencing data being generated faster than we can make sense of it. This in concert with future aspirations of personalized medicine.
Whether you are a PhD learning bioinformatics along the way, a new-joiner on a bioinformatics team, or are considering a change to a new field, there are probably a few things you’ve encountered (or will encounter) while working with genomics data.
1. Managing LArge Files… and lots of them
On previous projects as an analyst, I was usually only working on one dataset or csv file at a time. Sometimes a few which I would generally incorporate into one dataset. The code I was writing was pulling data from multiple tables or databases to create one dataframe.
These days there isn’t a SQL query in sight, as all the files are sitting on the server. Genetic sequences don’t really suit a tabular format…
In a bioinformatics pipeline you have to work on multiple files simultaneously. For each subject or pool in the experiment there are multiple samples which may reflect different time points, treatments etc.
This means everything is done in bulk. You need to be careful that your program or pipeline has run on every single file and you haven’t missed any during the processing. This calls for immense organization – in your workspace and workflow.
Maybe your program will fail on one file in your folder of samples and cause you trouble at a later stage. Making spreadsheets can help to keep tabs on files at different steps of the pipeline.
Having solid and consistent naming conventions for your folders and files is always important but I find it to be especially necessary to keep track of all the files and their locations.
The files also happen to be huge. Which brings me to my next point..
2. Bash Scripting
The FASTQ files on the server are huge. A 300GB folder of bacterial sequences is too large to work on locally. Hacking together a small bash script is an extremely powerful and quick way to perform an operation on every file in a folder.
In fact it is the only way.
Maybe you need to add a prefix to every single node ID in a 8 million line file or need to count the number or length of contigs in multiple files. At some stage you’ll need to generate a batch file of commands which will run on each file in a folder. This is where a bash script comes into play.
Writing programs in vim or nano are necessary to handle and manipulate large swathes of files. For loops, while loops, read while loops, and piping output from one program or command to another become your sole focus.
Data scientists wrangling big data are probably accustomed to the occasional bash script but for me this was new.
3. Increased repertoire of linux commands
I have definitely increased my understanding of linux command line utilities from spending more time rooting around on the server. My most used ones are probably:
grep is a go-to to find snippets of text in files, to count the number of contigs or sequences in a fasta file or to count the unique samples in a file:
grep -c "^>" virome_sequences.fasta
Some common uses of grep in bioinformatics can be found here. I would definitely recommend learning regular expression syntax and getting comfortable with accessing and changing strings based on a pattern.
Awk is a whole programming language within linux and is very efficient for working with text.
Find – recursively finds all files with a certain prefix or name inside your current directory or another.
Sed – is another text processing command.
Familiarizing yourself with regular expressions in general is extremely important.
4. Knowing your Server
Data scientists with cloud computing experience are no stranger to connecting to a remote server or VM instance. Chances are if you are working on a couple of files locally you probably won’t need to enter your terminal, apart from starting an IDE or jupyter notebook.
In no time you’ll be able to find your way around a terminal blindfolded. An absolute lifesaver is to use remote-ssh in visual studio code to connect to your remote server.
Something I had to learn was server etiquette. If you are sharing a server for work likely there is a process or thread limit per user. This is to not overload the system and keep it in balance.
The best way to check what processes you are currently running is with the htop command. You can also filter by user:
htop -u linda
Kill a process or using
pkill to kill process by name.
It’s also best practice to be conservative about space used.
Checking for disk space,
df -h, checking size of folder
du -h <foldername>
You’ll also be using
scp a lot to transfer files from local to remote servers and vice-versa.
5. Understanding what tools to apply to your data
Biological problems are complex and there are many layers to the analysis. Coming from working with financial data where you are concerned with rolling averages or calculating time-series percentages, the processing and calculations behind sequencing data is dense involving many dependent steps.
Having heard the term ‘pipeline’ being thrown around a lot in connection with genomics, it never really clicked. Now I’m aware there are a long list of functions and programs that you ‘push’ your data through, transforming the data at different steps to ultimately generate a result.
Also depending on the question being asked by the experiment, the angle of your analysis will change the pipeline and the approach to the analysis. Below is a great flow chart I found on https://astrobiomike.github.io outlining a generic metagenomics workflow.
Metagenomics is just one facet or genre of bioinformatics. There is also transcriptomics e.g. RNA-Seq analysis, phylogenomics, proteomics, pangenomics… which open avenues to multiple varying pipelines and tools to apply to your dataset and biological question. You can see how a large-scale study that spans multiple threads of bioinformatics would compound the skills and knowledge required.
A bioinformatician’s weapon of choice is R. There are many modules and statistical packages in R which were written especially for analysing genomic data such as DESeq2 and phyloseq. Bioconductor is the R module which contains a whole bioinformatics armoury.
Coming from the world of python, the functional R syntax can be at odds with that of python’s object-oriented. The numerous libraries and built-in stats packages in R which are quick to implement will soon warm you to R’s syntax. R’s ggplot library also make fantastic publication quality plots that allow for a high level of customization.
A lot of bioinformatics software is written in python, for example metaphlan2 and metaspades, so there is no harm in being proficient with python. The BioSeq python module can come in handy for hacking together a quick script to process fasta files.
The statistics side of things are not surprisingly a lot more extensive than I had encountered before. Pulling out an average or percentage value for a dataset seems trivial in comparison to the stats involved in making sense of genomics data.
A non-exhaustive list of stats I’ve encountered off the top of my head include:
- Methods for normalizing and rarefying your data
- Calculating alpha diversity which involves probability metrics such as Shannon entropy, Simpson’s index and measures of species richness such as Chao or ACE
- PCoA/NMDS plots are used to visualize distance metrics and permanova scores represent significances in differences between centroids of clustered groups
- Differential gene expression calculates the log2 fold change between groups
- Comparing means of two groups – Anova, t-test
- Comparing ranks of two groups – Kruskal-Wallis
- P-value adjustment and correction
- Machine learning and model accuracy validation
- The E value in sequence alignment
I’m definitely not a statistics wizard by any means but am working towards understanding the various formulae and when it is appropriate to apply them.
8. DATA Visualization
If you enjoy data visualization you will love bioinformatics! The diversity in plots for representing data are huge. Boxplots, PCoA, NMDS, MA, violin and volcano plots, heatmaps, phylogenetic trees….. are all exotic in comparison to barplots and lineplots. It’s a data scientist’s dream!
9. Cloud Computing
If your workplace doesn’t have a bare metal server, you will be connecting to a VM instance in the cloud. AWS and GCP are probably the most popular cloud computing platforms both with their own command-line tools.
Even if you have enough space and resources on your work server, it’s definitely a good idea to back your work up on an independent server. Google cloud storage is a cheap and easy option to use.
10. Interpreting Biological context and experimental design
Because of the unique situation and privilege of being handed over the result of someone’s weeks or months worth of lab work to analyze, you will need to understand the experimental design and biological context.
You’ll soon find yourself reading the scientific literature surrounding the topic and getting intrigued about the instruments and methods required for the wet lab side of things.
Being exposed to many different projects and types of experiments producing sequencing data makes the job extremely interesting and varied.