An anonymous programmer under the pseudonym Satoshi Nakamoto published the bitcoin whitepaper on October 31, 2008. This set in motion a domino effect spawning the adoption of blockchain technology and consensus algorithms. A number of fields have been invigorated by this, with the most obvious being fintech and businesses with a supply chain where a decentralized ledger of transactions benefit every participant in the network.
Not only did Satoshi catalyze a shift in financial payment systems leading to the dawn of cryptocurrency, they founded a movement of people who value transparency, privacy, decentralization (or complete autonomy) of their financial dealings above all else.
You’re probably wondering how blockchain can be applicable to genomics, and you’d probably be surprised how ruthless genomic companies can be with your data. Even though you paid them for the service of sequencing and analyzing your DNA, there is no limit to how many times they can profit from your data in the future. And that is just the tip of the iceberg.
To understand how blockchain and genomics can intertwine we must first get the gist of what blockchain even is.
- What is blockchain?
- The mainstream genomics data business model.
- The need for privacy in genomics data.
- DNA storage concerns.
- The concept of storing genomics data on a blockchain.
- Who currently occupies the genomics-blockchain space?
What is blockchain?
Blockchain, first and foremost, is a database. What makes blockchain so notorious is the level of security it provides, the anonymity of transactions and the decentralized nature of its architecture. Blockchain provides a single source of truth. This means that all transactions are public and auditable.
A very simple analogy to understand blockchain is to imagine a massive spreadsheet.
- Every transaction becomes a new row to the spreadsheet. Columns in the spreadsheet would include sender, receiver, transaction ID and a timestamp.
- Every transaction is broadcasted to the network (everyone can view transactions or rows being added in real-time).
- Everyone on the network can access the entire blockchain history i.e. every transaction that has ever taken place.
- But no one can change the history; transactions are immutable and permanently associated with the blockchain.
- New transactions that occur at the same time are bundled into blocks. These can be thought of as multiple contiguous rows of a spreadsheet.
- Before the blocks get authorized and added to the chain of blocks (below earlier spreadsheet rows), the transactions it contains must first be checked by miners or validators on the network.
- Miners are a group of participants on the network who have a copy of the entire blockchain ledger and are running an algorithm, a cryptographic hash, which will basically run the same code over and over until a “magic number” is generated which cracks the code, ensuring the transactions in the blocks are valid and the blocks are in the correct order.
- This algorithm is called proof-of-work (PoW) and is one method of validating blocks and adding them to the chain.
- In return for the miners work in legitimizing blocks they receive a financial incentive in the form of a token or the network’s currency.
Decentralization prevents a single entity from controlling the data; immutability guarantees that data cannot be altered; and security is ensured by protecting accounts with enhanced cryptographic method.Security and Privacy on Blockchain
Electricity goes in, valid transactions come out.
Any middleman or centralized office that can withhold or delay your transaction, deny you access to your account or information regarding the total network capacity or traffic ceases to exist. What’s happening on the network is completely transparent.
Blockchain development is still in its infancy and what I just described is the method of the most energy inefficient algorithm underlying bitcoin and other PoW cryptocurrencies. Other algorithms such as proof-of-stake (PoS), delegated-proof-of-stake (DPos) and proof-of-space (PoS) are evolutions of PoW which aim to maximize speed and minimize computation, but to achieve this they compromise decentralization and security on some level.
Future iterations of this algorithm and modifications to the blockchain data structure can be tailored to a specific business requirement and will undoubtedly become more sophisticated.
The take home point is that the nature of the blockchain ecosystem and the flow of data through it makes it impossible to cheat the system by inserting false transactions or data.
Mainstream Genomic Data Model
The current model for commercial genetic data acquisition and analysis has been dominated by private companies such as 23andMe and Ancestry which have a database of 10 million and 18 million samples respectively.
In the case of 23andMe, 80% of their customers consent to their data being used (sold) for research. Fair enough, I would probably consent too, imagining my data would be used for the greater good. Just a 3GB drop in the pool of 10 million samples (10 petabytes), my DNA could contribute to discovering new insights about human health and disease.
What most of these customers do not know is that their data is sold multiple times over to whoever wants to buy it. The companies profit from the data their customers pay them to take. A pretty ingenious business model.
Companies like 23andMe have a monopoly on genomic data and are gatekeeping it from the research community and anyone who doesn’t have the $300M to access it. A research institution with modest means is privy to an application process with limited participants.
Imagine how valuable a search engine or database of complete human genomes where you could filter by disease type, lifestyle, body characteristics or demographic would be for the scientific community. Like an open source pubmed containing millions of completely anonymous human genomes.
the need for privacy in genomics
Acquiring data has become a game for many companies. Human behaviour, personality and ideology can be predicted very closely with social media data alone.
Genome analysis is rapidly expanding our understanding of human health and our predispositions. Genomic markers can predict your propensity for addiction, developing a mental health disorder, the likelihood of getting acne, heart disease… the list goes on.
The drive towards providing a genomics-based personal medicine solution and the behemoth of data being generated through NGS technology is only going to accelerate our understanding of what makes us human. Google’s Deepmind AI AlphaFold can now predict protein-folding structures with an almost 90% accuracy.
It really is remarkable, yet I am hesitant in providing my DNA to a corporate entity.
Your DNA is 100% you, unique to every cell of your body, so much more valuable than the application signup form info we are so accustomed to providing. In concert with your HR, GPS and sleep tracking data your smart watch gathers, your DNA sequence would complete your profile.
In a not-too-far-away futuristic world one could imagine how biometric data could be hacked and someone posing as you could access your data, implicate you for crime through DNA fraud, or a genuine crime if that’s how you roll. Insurance companies could increase your healthcare premium because you have a biomarker for a disease, or marketing agencies could direct more targeted advertisements to tailor your propensities.
For those of us who are interested in learning about ourselves through our DNA but are concerned about the privacy of our data, how can a fair and transparent transfer of our DNA as a resource be accommodated, ensuring the owner has access and distribution rights while allowing scientists and medical professionals to study our data without us being exploited?
DNA Storage concerns
Sequencing data presents a scalability challenge and a computational burden given the memory required to store and process fastq files. Cloud computing is an obviously appealing storage solution but an expensive one which compromises privacy.
Currently companies store your DNA data on a centralized server such as a cloud computing platform or a bare metal server. Already this presents privacy and data autonomy issues. Technicians maintaining the server and analysts with access permission can download, view or potentially corrupt any of the files. Centralized servers also represent a point of vulnerability where a data breach or server outage could lead to loss of the data.
It’s clear industries generating genomic data need a unified method and standard for data storage. I see the potential blockchain can have for revolutionizing the way we think about and access DNA data.
Blockchain could be leveraged to:
- Reward resource sharing (computation and storage).
- Facilitate decentralized data distribution.
- Promote collaborative work.
- Enable genome privacy.
Cryptography and compression
The architecture of a genomic blockchain would be no easy task to set up. Storing huge files in a decentralized manner is not a trivial endeavour.
A python implementation built on top of the open-source MultiChain blockchain was contributed by the authors of this 2020 paper, Storing and analyzing a genome on a blockchain. Their take was to store DNA sequences on a blockchain in BAM files which is a compression format for raw genomics data. Over this they built a layer of abstraction which they called a “data stream”. This layer contains metadata about the individual BAM file it is associated with (publisher, transaction ID, block confirmation etc.). The data stream metadata stored in a dictionary format serves as an index for the BAM files, making data querying and retrieval fast through retrieving key-value pairs.
They’ve also set in place the architecture for helper modules to be built upon their blockchain data structure. These modules can process and operate on the BAM files directly.
This is just one example on an open-source implementation of a genomics blockchain. With additional layers engineering an API system to send your BAM files to the network for mining, as well as token exchanges for miners and smart contracts you could see how this project could become a budding genomics data ecosystem.
One caveat for adoption of a genomic blockchain model is GDPR. EU rules as of 2018 sanction that any user is able to delete their personal data from a provider’s platform. To enable a deletion request would require a new solution or modification of blockchain which would require some planning and community consensus.
Who currently occupies the blockchain-genomics space?
A number of startups who condone the philosophy of decentralization have realized the potential market gap for the incorporation of blockchain into the genomics industry.
Of the currently existing projects which are active, Encrypgen with their Gene-Chain blockchain is a company I understand to have a solid vision and values. Comprised of husband and wife team, David Koepsell, an ethicist and author of many genomics books including Who Owns You? Science, Innovation, and the Gene Patent Wars and Dr. Vanessa Gonzalez, a genomic scientist, they founded Encrypgen after winning a lawsuit against Myriad Corporation who were trying to patent the breast cancer genes, BRCA1 and BRCA2. This ultimately prevented Myriad from gaining a monopoly on breast cancer testing and essentially owning the rights to a segment of DNA endogenous to all humans.
Their whitepaper provided a lot of inspiration for this post. They clearly present the cons of the current genomic data market and how Gene-Chain wishes to create an open and currency agnostic market for buying and selling data. For free you can register and upload your DNA to https://encrypgen.com/ and get paid in their DNA token. You can also buy data of interest. I made an account and tried searching for data I could buy.
There are a number of options you can filter from such as data provider, ethnicity, allergies etc.
I surmised they don’t have very much metadata about the sequences users have provided (or many sequences at all) as the filtered options didn’t return much data. That’s to be expected, I understand this concept of selling your DNA data is fairly underground and if you weren’t interested in blockchain how would you find out about storing your DNA on one? A UI revamp and modern rebranding campaign would have Encrypgen well on their way.
It’s an exciting project as they have laid the groundwork for the open-source genomics database and marketplace I envision.
Another notable point is that they founded the Genomic Blockchain consortium to further the open source nature of genomic blockchain development and discuss democratic governance structure. This gives me the impression that they are an overall ethical organization.
You can look at the transaction logs for their blockchain here: https://etherscan.io/address/0x82b0E50478eeaFde392D45D1259Ed1071B6fDa81
- Aiming to create a decentralized genomic marketplace
- You have to provide your own sequence
- Have their own token – DNA – built upon the Ethereum blockchain
- Whitepaper outlines clearly their transparent objective and vision
- Interested in collaborating and consolidating standards in genomics blockchain development
- Ethical founders
- Project still in its infancy, launched in November 2018
A well-funded start-up is Nebula Genomics, whose founder Prof. George Church of Harvard University wants to incentivize sharing of genomic data by paying users who contribute their data with crypto tokens.
Church has been involved in some interesting projects such cloning wooly mammoth genes into Asian elephants and would probably consider himself a serial entrepreneur having co-founded 22 companies.
A founder of the successful genome sequencing company, Veritas, Nebula are jumping on the blockchain bandwagon to leverage this marketing angle. After I signed up to their website, I received multiple emails about deals to have your genome sequenced. They are very much offering a product rather than a philosophy, and blockchain is offered as a feature rather than a focus.
Nebula do 30x Whole Genome Sequencing which basically reads every letter in your genome 30 times to ensure clinical accuracy, resulting in files 100 GB in size.
On their website, they state they are currently building a blockchain system. They have partnered with Oasis Labs who will be responsible for the development and implementation of the underlying blockchain architecture.
Nebula have some interesting design principles for their blockchain. They plan for:
- A ranking algorithm to define measurement of value of nodes on the network (similar to Google’s page rank).
- “Self-evolving” architecture, meaning the ability to update all nodes on the network in the case of implementing a system update or software patch.
- A proof of devotion consensus algorithm which rewards users who are the most “devoted” to the network.
From a product and technical viewpoint, Nebula seems solid. You can watch a product review here where it’s evident there is a good bit of functionality built into the Nebula backend.
- Technical whitepaper and innovative blockchain design
- Published proof-of-concept articles in scientific journals
- Sequence your DNA for you (300$)
- Sophisticated analysis of your sequence and probably have a large existing bank of sequences from partner company Veritas
- Updated analysis and access to new NGS findings (for a subscription fee)
- Corporate blockchain
- Blockchain design tends towards centralization e.g. Nebula Force implements system wide update
- Partner, Oasis Labs are a “privacy first” blockchain
- Blockchain as a feature rather than a focus
- High profile collaborators and investors
The concept of genome data being incorporated onto a blockchain is very exciting. Hopefully more people will get working on implementing an open source blockchain for widespread adoption with genomics data.
It’s great companies like Nebula are offering to store your data on a blockchain, however it still seems that there are terms and conditions you would have to operate within on their blockchain ecosystem.
Feel free to comment or correct me if there are any aspects regarding the projects or technology I have missed or have misunderstood.
Encrypgen’s Gene-Chain whitepaper https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4woQm
Realizing the potential of blockchain technologies in genomics https://genome.cshlp.org/content/28/9/1255.full.pdf+html
Blockchain Technology: Beyond Bitcoin https://j2-capital.com/wp-content/uploads/2017/11/AIR-2016-Blockchain.pdf
Data privacy in the age of personal genomics https://drive.google.com/file/d/1ctEU0haB553aKjyXN4CxqcNxOaUC9d35/view