DATA-MINING TIPS 


A wealth of unexploited NGS data is available online ready for researchers to scrutinize. Why repeat a study that already have been conducted, spending money and time on something that can be downloaded directly from online resources? 


Here we give you some tips and tricks to how you get started with digging for data-gold

DATA-MINING SERIES PART 1

How to do molecular biology research from home


Did you know that most journals today requires you to upload NGS data upon acceptance for publication?


What that means for you, is that there is a wealth of information available in the form of a massive collection high-throughput data sets in a central repository.


All you need is a computer and internet:


Improve the impact of your current study by comparing the findings to related published investigations


OR find a strong basis for designing your next project.


DATA-MINING SERIES PART 2: INTRODUCTION TO GENE EXPRESSION OMNIBUS (GEO)


Introduction


Researching development and disease entails the use of comparative techniques, exploring the changes in a cell between e.g. young and old, WT and mutant or treated and untreated. In addition to these functional changes, techniques assessing changes in RNA levels can also be applied to detect molecular features such as alternative splicing, circularization and editing and interpretation of mutations affecting transcription regulation e.g. promoter region, enhancer, regulatory RNA.


Advanced comparative technologies, such as Next Generation Sequencing (NGS), provide a wealth of data. Previous techniques, including qPCR and array, provided the researcher with information on expression of a specific pre-selected subset of the cellular genes. NGS on the other hand, gives you a complete catalogue detailing the expression of each individual gene expressed. When used in a publication, NGS data supports the specific aim of that publication – very often this is not a comprehensive analysis, and there is still a wealth of information left in that specific dataset.


Once the particular study is published, the complete dataset will be uploaded to a publicly available repository, the Gene Expression Omnibus, or GEO, from where it is freely available to other researchers who can investigate the data in the light of their own research focus.  

 

Gene Expression Omnibus (GEO)


GEO is an international publicly accessible repository for high-throughput functional genomic data sets, including results from microarray and next-generation sequencing projects. The resource contains raw data, processed data and metadata, which is searchable and linked to the original research paper. The sample data available in GEO can be used directly for third-party reanalysis in your publication.  Today, GEO archives data for approximately 125,000 studies, representing more than 3 million samples from laboratories around the world. 


Barrett T. Gene Expression Omnibus (GEO) 2013 May 19. In: The NCBI Handbook [Internet]. 2nd edition. Bethesda (MD): National Center for Biotechnology Information (US); 2013-. https://www.ncbi.nlm.nih.gov/books/NBK159736/

 

 

 

How to use Gene Expression Omnibus

 

RNA-seq data from GEO can be accessed in a number of ways. Here we will give small hints for two approaches. 1) An automated online tool called BioJupies, which does not require download and local analysis of the data. 2) Download of GEO data for local analysis.


 

1. A range of tools have been developed with the specific aim of utilizing the data stored in the GEO repository.

One such tool is BioJupies, which is a webserver application that enables automated RNA-seq analysis through a graphical user interface. This tool can be used for local data or GEO data. Video tutorials are available and serve as an easy was to learn about BioJupies:

https://www.youtube.com/watch?v=KMIrW3wb690&list=PLfq4yYrYksVjtq2-vjwnywqGMLAJwrBsR

 

BioJupies is great because it makes a limited number of advanced analysis steps easily available to any researcher. Using this tool, online available datasets can be further explored and examined for interesting observations.

No bioinformatics skills are needed, since analyses are all done using a graphical user interface.

A limit to this kind of approach is that only analyses already made available by BioJupies can be performed, and there is a limited space for changing parameters. With ease of use often comes a limit in flexibility and options, however, BioJupies is an excellent starting point for researchers new to GEO.

 


 

2. The freely available NGS data on GEO is stored in SRA files, which can be downloaded and converted to fastq files. This download and conversion can be done using fastq-dump from the sra-toolkit.

Install instructions and download links for sra-toolkit can be found here:

https://ncbi.github.io/sra-tools/install_config.html

 

The fastq-dump tool has many options, but a good way to use the tool could be:

fastq-dump --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip [SRR_ID]

[SRR_ID] is the ID used in GEO

 

The above command will download data with the ID of your choice, convert from sra to fastq, split files if these are paired end, zip the fastq files so they do not take up too much harddrive space. 

You can see this tutorial on using fastq-dump:

https://edwards.sdsu.edu/research/fastq-dump/

 

The downloaded data can now be analysed as normal local fastq data.

Downloading data from GEO allows you to add to your own data, strengthening your publication or formulating hypothesis for future studies.

CONTACT US

omiics ngs
omiics ngs

Åbogade 15, 8200 Aarhus N, Denmark

Tel: (+45) 28727107

omiics rna ngs

NGS service provider with a focus on customer needs. We provide flexible knowledge-based answers to your NGS needs.


Contact us for a non-binding discussion about your next project.

CVR: 39 73 23 43

© Copyright 2018. All Rights Reserved.