Where To Get Fast5 Files

Where to get fast5 files is a common question among researchers and bioinformaticians working with nanopore sequencing data. Fast5 files are crucial because they contain raw signal data generated during sequencing runs, providing in-depth information that is often necessary for custom analysis, basecalling, or developing new algorithms. These files originate from Oxford Nanopore Technologies' devices, such as MinION, GridION, and PromethION, and are fundamental to understanding the raw output of nanopore sequencing experiments.

In this comprehensive guide, we will explore various sources and methods to obtain fast5 files, including public repositories, institutional datasets, sequencing service providers, and how to generate them from raw data. Whether you are a beginner or an experienced researcher, this article aims to provide detailed insights into where and how to access fast5 files efficiently.

---

Public Repositories and Data Archives for fast5 Files

One of the most accessible ways to obtain fast5 files is through public repositories that host nanopore sequencing datasets. These repositories are invaluable for benchmarking, method development, or educational purposes.

1. National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA)

The NCBI SRA is a comprehensive repository that stores raw sequencing data, including nanopore datasets.

- Accessing fast5 files:
While SRA primarily hosts processed or basecalled data, some datasets include raw fast5 files. To find these, search for datasets labeled as "raw" or "uncalled."

- How to download:
Use the SRA Toolkit, a command-line tool, to download datasets:
```bash
prefetch
fastq-dump --split-files --skip-technical --readids --dumpbase --clip
```
Note: SRA often stores basecalled data; ensure the dataset contains raw fast5 files.

- Limitations:
Not all datasets in SRA include fast5 files; some are only basecalled or processed.

---

2. European Nucleotide Archive (ENA)

ENA is another major public archive hosting sequencing data, including nanopore datasets.

- Access points:
ENA offers raw data downloads through its web portal or APIs.

- Downloading fast5 files:
Use ENA's FTP servers or their API to locate datasets with raw fast5 files. Often, projects are labeled with "raw" or "fast5."

- Advantages:
ENA often provides more detailed metadata, facilitating easier identification of relevant datasets.

---

3. Open-access Nanopore Data Repositories

Several repositories are specifically dedicated to nanopore sequencing datasets:

- Nanopore Community Data Portal:
Oxford Nanopore Technologies maintains a data portal with datasets contributed by users and collaborators. It often hosts raw fast5 files for various projects.

- Nanopore WGS Project:
A collection of whole-genome sequencing datasets that are publicly available with raw fast5 files.

- Data repositories like Zenodo and Figshare:
Researchers sometimes upload their raw fast5 datasets here, often associated with publications.

---

4. Specialized Data Repositories and Initiatives

- The NIH Nanopore Data Archive:
Hosted by the National Institutes of Health, this archive provides datasets for research purposes.

- The 1000 Genomes Project:
Some datasets include raw nanopore fast5 files, especially those focusing on structural variants.

---

Accessing fast5 Files from Sequencing Service Providers

Many researchers outsource sequencing to commercial providers or core facilities. These services often deliver data in fast5 format.

1. Oxford Nanopore Technologies Directly

- How to obtain fast5 files:
When ordering a sequencing run, specify that raw fast5 files are required. The provider will deliver data via download links, FTP servers, or cloud storage.

- Data delivery options:
- Direct download through secure links.
- Cloud-based access like AWS or Google Cloud.
- Physical delivery on external drives or storage media in some cases.

2. Commercial Sequencing Services and Core Facilities

- Many sequencing centers and commercial labs offer fast5 files as part of their data package. Confirm during the contract or order process that raw data in fast5 format will be included.

- Advantages:
- High-quality data with proper metadata.
- Support for large datasets.

- Considerations:
- Data privacy and usage rights.
- Possible costs associated with data delivery.

---

Generating fast5 Files from Raw Data

In some cases, users may have raw signal data in other formats or may want to generate fast5 files from initial data.

1. Basecalling and Data Conversion Tools

- Guppy Basecaller:
Oxford Nanopore's proprietary software for basecalling raw signal data. It outputs fast5 files, basecalled reads, and summaries.

- MinKNOW:
The control software for nanopore sequencing devices; it saves raw fast5 files during runs.

- Albacore:
Older basecaller software that also generates fast5 files.

2. Using Raw Signal Data Files

- If you have raw signal data in formats like `.fast5`, `.fast5.gz`, or `.fast5.index`, you can process or re-basecall them using Guppy or other compatible tools.

3. Reconstructing fast5 Files from Processed Data

- While less common, some tools and scripts allow creating pseudo-fast5 files from processed data, but these are generally not recommended for standard analysis.

---

Important Considerations When Accessing fast5 Files

- Data Privacy and Usage Rights:
Always verify licensing and permissions, especially when datasets are from private or unpublished sources.

- Metadata and Sample Information:
Ensure that datasets include sufficient metadata, such as sample origin, sequencing conditions, and run parameters.

- File Size and Storage:
Fast5 files are large, often several gigabytes per dataset. Prepare adequate storage and transfer bandwidth.

- Compatibility and Software Requirements:
Use compatible tools for handling fast5 files, such as HDF5 libraries, poretools, or nanopolish.

---

Summary of Key Sources for fast5 Files

- Public repositories:
- NCBI SRA
- ENA
- Nanopore Community Data Portal
- Zenodo, Figshare

- Sequencing service providers:
- Oxford Nanopore Technologies’ official channels
- Commercial and core sequencing facilities

- Generating from raw data:
- Using Oxford Nanopore’s Guppy, MinKNOW, or Albacore software

- Other specialized repositories and initiatives
- NIH Nanopore Data Archive
- 1000 Genomes Project datasets

---

Conclusion

Obtaining fast5 files is an essential step for many nanopore sequencing applications, from research and development to method validation and education. The most straightforward approach is to leverage publicly available datasets hosted in repositories like NCBI SRA, ENA, or specialized nanopore data portals, which offer a wealth of raw signal data for free. For ongoing projects, working with sequencing service providers ensures access to high-quality, well-annotated fast5 files. Additionally, researchers can generate fast5 files from raw data using Oxford Nanopore’s software tools, giving flexibility beyond publicly available datasets.

Understanding where to find these files, how to access them, and the considerations involved will empower researchers to utilize nanopore sequencing data effectively, advancing scientific discovery in genomics, transcriptomics, epigenetics, and more.

---

Note: Always ensure compliance with data usage policies and cite datasets appropriately when using publicly available fast5 files in your research.

Frequently Asked Questions

Where can I download fast5 files for Oxford Nanopore sequencing data?

You can access fast5 files from public repositories such as the NCBI Sequence Read Archive (SRA), ENA (European Nucleotide Archive), or from specific sequencing project websites and databases like the Nanopore Community or GitHub repositories associated with nanopore research.

Are there any online datasets that provide fast5 files for educational purposes?

Yes, platforms like the Nanopore Community, Zenodo, and Dryad host datasets containing fast5 files for educational and research purposes, often linked within published papers or shared by research groups.

Can I get fast5 files directly from Oxford Nanopore's MinKNOW software?

MinKNOW generates fast5 files during sequencing runs, but access to raw fast5 files depends on the local storage setup. You can retrieve them from your device's storage directory designated for raw data.

Is it possible to access fast5 files from cloud-based nanopore data repositories?

Yes, cloud platforms like AWS Open Data Registry host nanopore sequencing datasets, including fast5 files, which can be accessed and downloaded directly for analysis.

Are there any community forums or platforms where I can find fast5 files shared by researchers?

Platforms like the Nanopore Community Forum, GitHub repositories, and bioinformatics data sharing websites often have users sharing fast5 datasets for collaborative research and troubleshooting.

How do I find fast5 files for specific species or samples?

Search public databases such as SRA or ENA using relevant keywords or accession numbers. Many published studies also provide links to their raw fast5 data in supplementary materials.

Can I generate fast5 files from my own nanopore sequencing experiment?

Yes, by running your sample through Oxford Nanopore's MinKNOW software, you will generate fast5 files stored locally, which can then be processed for analysis.

Are there any open-source tools to convert fast5 files into other formats?

Tools like ONT's Guppy basecaller, Poretools, and Fast5 API allow you to process, extract, and convert fast5 files into formats suitable for downstream analysis.

What are the best practices for storing and sharing fast5 files publicly?

Use reliable data repositories with proper metadata, ensure data privacy if applicable, and adhere to community standards for data sharing, such as submitting to SRA, ENA, or Zenodo with detailed descriptions.

Are there any restrictions or licenses when downloading fast5 files from public sources?

Yes, licensing varies; some datasets are openly available under licenses like CC-BY, while others may have restrictions. Always check the data's usage terms and cite appropriately when using public datasets.