{{nav.loginGreeting}}
  • 获取数据
      • 出现记录
      • GBIF API
      • 物种
      • 数据集
      • Occurrence snapshots
      • Hosted portals
      • 趋势
  • How-to
    • 共享

      • 发布数据
      • 数据集类型
      • 数据托管
      • 标准
      • 成为一个发布者
      • 数据质量
      • 数据论文
    • Use data

      • 精选的数据使用
      • 引用说明
      • GBIF citations
      • Citation widget
  • 工具
    • 发布者

      • IPT
      • 数据验证器
      • GeoPick
      • New data model ⭐️
      • GRSciColl
      • 建议一个数据集
    • 用户

      • Hosted portals
      • Scientific collections
      • 数据处理
      • Derived datasets
      • rgbif
      • pygbif
      • MAXENT
      • 工具目录
    • GBIF实验室

      • 物种名称匹配(Species matching)
      • 名称解析器(Name Parser)
      • 序列ID
      • 相对观测趋势
      • GBIF数据博客
  • Community
    • Network

      • GBIF网络
      • Nodes
      • 发布者
      • Network contacts
      • 社区论坛
      • 生物多样性知识联盟
    • Volunteers

      • 指导
      • 大使
      • Translators
      • 公民科学
    • Activities

      • Capacity enhancement
      • 方案和项目
      • Training and learning resources
      • Data Use Club
      • Living Atlases
  • 关于
    • GBIF机构内部

      • GBIF是什么机构?
      • 成为会员
      • 管理
      • Strategic framework
      • Work Programme
      • 资助机构
      • 合作伙伴
      • Release notes
      • 联系信息
    • 新闻与宣传

      • 新闻
      • 通讯和列表
      • 活动
      • 奖项
      • Science Review
      • Data use
  • User profile

INSDC Environment Sample Sequences

Citation

European Bioinformatics Institute (EMBL-EBI), GBIF Helpdesk (2024). INSDC Environment Sample Sequences. Version 1.97. European Nucleotide Archive (EMBL-EBI). Occurrence dataset https://doi.org/10.15468/mcmd5g accessed via GBIF.org on 2024-08-12.

Description

This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: environmental_sample=True & host=""

EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).

The data was then processed as follows:

1. Human sequences were excluded.

2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.

3. Contigs and whole genome shotgun (WGS) records were added individually.

4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.

5. The records associated with the same vouchers are aggregated together.

6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by scientific_name, collection_date, location, country, identified_by, collected_by and sample_accession (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978

7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip

More information available here: https://github.com/gbif/embl-adapter#readme

You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

Taxonomic Coverages

Geographic Coverages

Worldwide

Bibliographic Citations

Contacts

European Bioinformatics Institute (EMBL-EBI)
originator
email: datasubs@ebi.ac.uk
homepage: http://www.ebi.ac.uk
GBIF Helpdesk
metadata author
email: helpdesk@gbif.org
European Bioinformatics Institute (EMBL-EBI)
administrative point of contact
email: datasubs@ebi.ac.uk
homepage: http://www.ebi.ac.uk
GBIF是什么? API 常见问题解答 电子通讯 隐私政策 使用条款与协议 引用 行为准则 致谢
联系我们 GBIF Secretariat Universitetsparken 15 DK-2100 Copenhagen Ø Denmark
GBIF is a Global Core Biodata Resource