Skip to content
This repository was archived by the owner on Jun 29, 2019. It is now read-only.

Latest commit

 

History

History
37 lines (30 loc) · 2.59 KB

File metadata and controls

37 lines (30 loc) · 2.59 KB

Data Preparation

This section describes how to download the MEDLINE abstracts from the website using Spark. We are using HDInsight Spark 2.1 on Linux (HDI 3.6). The FTP server for Medline has about 812 XML files where each file contains about 30,000 abstracts. Below you can see the fields present in the XML files. We use the Abstracts extracted from the XML files to train the word embedding model.

The [Notebook]((1_Download_and_Parse_Medline_Abstracts.ipynb) AT LINK DOESN'T WORK describes how to download the local drive of the head node of the Spark cluster. Since the data is big (about 30 Gb), it might take a while to download. We parse the XML files as we download them. We use a publicly available Pubmed Parser to parse the downloaded XMLs and save them in a tab separated file (TSV). The parsed XMLs are stored in a local folder on the head node (you can change this by specifying a different location in the notebook). The parse XML returns the following fields:

    abstract
    affiliation
    authors
    country	
    delete: boolean if False means paper got updated so you might have two XMLs for the same paper.
    file_name	
    issn_linking	
    journal	
    keywords	
    medline_ta: this is abbreviation of the journal nam	
    mesh_terms: list of MeSH terms	
    nlm_unique_id	
    other_id: Other IDs	
    pmc: Pubmed Central ID	
    pmid: Pubmed ID
    pubdate: Publication date
    title

Notes:

  • There are more that 800 XML files that are present on the Medline ftp server. The code in the notebook downloads them all. But you can change that in the last cell of the notebook (e.g. download only a subset by reducing the counter).

  • With using Tab separated files: The Pubmed parser adds a new line for every affiliation. This may cause the TSV files to become unstructured. To avoid this we explicitly remove the new line from the affiliation field.

  • To install unidecode, you can use script action on your Spark Cluster. Add the following lines to your script file (.sh). You can install other dependencies in a similar way

      #!/usr/bin/env bash
      /usr/bin/anaconda/bin/conda install unidecode
    
  • The egg file needed to run the Pubmed Parser is also included in the repository. AT I DO NOT SEE THIS .EGG FILE IN THE REPO