Skip to content
This repository was archived by the owner on Jun 29, 2019. It is now read-only.

Reviewer#2-angustaylor #1

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,26 @@
"metadata": {},
"source": [
"## Download and Parse MEDLINE Abstracts\n",
"This Notebook describes the way you can download and parse the publically available Medline Abstracts. There are about 812 XML files that are available on the ftp server. Each XML file conatins about 30,000 Document Abstracts.\n",
"This notebook shows how you can download and parse the publicly available Medline abstracts. There are about 812 XML files that are available on the ftp server. Each XML file conatins about 30,000 Document Abstracts.\n",
"<ul>\n",
"<li> First we download the Medline XMLs from their FTP Server and store them in a local directory on the head node of the Spark Cluster </li>\n",
"<li> Next we parse the XMLs using a publically available Medline Parser and store the parsed content in Tab separated files on the container associated with the spark cluster. </li>\n",
"<li> Next we parse the XMLs using a publicly available Medline Parser and store the parsed content in tab separated files on the container associated with the spark cluster. </li>\n",
"</ul>\n",
"<br>Note: This Notebook is meant to be run on a Spark Cluster. If you are running it through a jupyter notebbok, make sure to use the PySpark Kernel."
"<br>Note: This notebook is meant to be run on a Spark Cluster. If you are running it through a jupyter notebbok, make sure to use the PySpark Kernel."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using the Parser \n",
"Download and install the pubmed_parser library into the spark cluster nodes. You can us the egg file available in the repo or produce the .egg file by running<br>\n",
"Download and install the pubmed_parser library into the spark cluster nodes. You can use the egg file available in the repo or produce the .egg file by running<br>\n",
"<b>python setup.py bdist_egg </b><br>\n",
"in repository and add import for it. The egg file file can be read from the blob storage. Once you have the egg file ready you can put it in the container associated with your spark cluster.\n",
"in repository and add import for it. The egg file can be read from the blob storage. Once you have the egg file ready you can put it in the container associated with your spark cluster.\n",
"<br>\n",
"**AT** I DO NOT SEE THIS .EGG FILE IN THE REPO\n",
"\n",
"#### Installing a additional packages on Spark Nodes\n",
"#### Installing additional packages on Spark Nodes\n",
"To install additional packages you need to use script action from the azure portal. see <a href = \"https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux\">this</a> <br>\n",
"Here's an example:\n",
"<br> To install unidecode, you can use script action (on your Spark Cluster)\n",
Expand Down Expand Up @@ -132,7 +133,7 @@
"metadata": {},
"source": [
"<b> Parse the XMLs and save them as a Tab separated File </b><br>\n",
"There are a total of 812 XML files. It would take time for downloading that much data. Its advisable to do it in batches of 50.\n",
"There are a total of 812 XML files. It will take some time to download this much data. It's advisable to do it in batches of 50.\n",
"Downloading and parsing 1 file takes approximately 25-30 seconds. "
]
},
Expand Down Expand Up @@ -192,9 +193,9 @@
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "NLP_DL_EntityRecognition local",
"display_name": "Python 3",
"language": "python",
"name": "nlp_dl_entityrecognition_local"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
17 changes: 8 additions & 9 deletions Code/01_Data_Acquisition_and_Understanding/ReadMe.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
## **Data Preparation**
This section describes how to download the [MEDLINE](https://www.nlm.nih.gov/pubs/factsheets/medline.html) abstracts from the Website using **Spark**. We are using HDInsight Spark 2.1 on Linux (HDI 3.6).
The FTP server for Medline has about 812 XML files where each file contains about 30000 abstracts. Below you can see the fields present in the XML files. We are currently using the Abstracts extracted from
the XML files to train the word embedding model
This section describes how to download the [MEDLINE](https://www.nlm.nih.gov/pubs/factsheets/medline.html) abstracts from the website using **Spark**. We are using HDInsight Spark 2.1 on Linux (HDI 3.6).
The FTP server for Medline has about 812 XML files where each file contains about 30,000 abstracts. Below you can see the fields present in the XML files. We use the Abstracts extracted from the XML files to train the word embedding model.


### [Downloading and Parsing Medline Abstracts](1_Download_and_Parse_Medline_Abstracts.ipynb)
The [Notebook]((1_Download_and_Parse_Medline_Abstracts.ipynb) describes how to download the local drive of the head node of the Spark cluster. Since the data is big (about 30 Gb), it might take a while to download. We parse the XML files as we download them. We are using a publicly available
[Pubmed Parser](https://github.com./titipata/pubmed_parser) to parse the downloaded XMLs and saving them in a tab separated file (TSV). The parsed XMLs are stored in a local folder on the head node (you can change this by specifying a different location in
the Notebook). The parse XML returns the following fields:
The [Notebook]((1_Download_and_Parse_Medline_Abstracts.ipynb) **AT** LINK DOESN'T WORK describes how to download the local drive of the head node of the Spark cluster. Since the data is big (about 30 Gb), it might take a while to download. We parse the XML files as we download them. We use a publicly available [Pubmed Parser](https://github.com./titipata/pubmed_parser) to parse the downloaded XMLs and save them in a tab separated file (TSV). The parsed XMLs are stored in a local folder on the head node (you can change this by specifying a different location in the notebook). The parse XML returns the following fields:

abstract
affiliation
Expand All @@ -28,11 +25,13 @@ the Notebook). The parse XML returns the following fields:
title

**Notes**:
- There are more that 800 XML files that are present on the Medline ftp server. The code in the Notebook downloads them all. But you can change that in the last cell of the notebook (e.g. download only a subset by reducing the counter).
- With using Tab separated files: The Pubmed parser add a new line for every affiliation. This may cause the TSV files to become unstructured. To **avoid** this we explicitly remove the new line from the affiliation field.
- There are more that 800 XML files that are present on the Medline ftp server. The code in the notebook downloads them all. But you can change that in the last cell of the notebook (e.g. download only a subset by reducing the counter).
- With using Tab separated files: The Pubmed parser adds a new line for every affiliation. This may cause the TSV files to become unstructured. To **avoid** this we explicitly remove the new line from the affiliation field.
- To install unidecode, you can use [script action](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux) on your Spark Cluster. Add the following lines to your script file (.sh).
You can install other dependencies in a similar way

#!/usr/bin/env bash
/usr/bin/anaconda/bin/conda install unidecode
- The egg file needed to run the Pubmed Parser is also included in the repository.

- The egg file needed to run the Pubmed Parser is also included in the repository. **AT** I DO NOT SEE THIS .EGG FILE IN THE REPO

24 changes: 12 additions & 12 deletions Code/02_Modeling/01_FeatureEngineering/2_Train_Word2Vec.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@
"metadata": {},
"source": [
"## Train, Evaluate and Visualize the Word Embeddings\n",
"In this Notebook we detail the process of how to train a word2vec model on the Medline Abstracts and obtaining the word embeddings for the biomedical terms. This is the first step towards building an Entity Extractor. We are using the Spark's <a href = \"https://spark.apache.org/mllib/\">MLLib</a> package to train the word embedding model. We also show how you can test the quality of the embeddings by an intrinsic evaluation task along with visualization.\n",
"In this notebook we detail the process of how to train a word2vec model on the Medline abstracts and obtaining the word embeddings for the biomedical terms. This is the first step towards building an entity extractor. We are using the Spark's <a href = \"https://spark.apache.org/mllib/\">MLLib</a> package to train the word embedding model. We also show how you can test the quality of the embeddings by an intrinsic evaluation task along with visualization.\n",
"<br>\n",
"The Word Embeddings obtained from spark are stored in parquet files with gzip compression. In the next notebook we show how you can easily extract the word embeddings form these parquet files and visualize them in any tool of your choice.\n",
"The word embeddings obtained from spark are stored in parquet files with gzip compression. In the next notebook we show how you can easily extract the word embeddings form these parquet files and visualize them in any tool of your choice.\n",
"\n",
"<br> This notebook is divided into several sections, details of which are presented below.\n",
"<ol>\n",
"<li>Load the data into the dataframe and preprocess it.</li>\n",
"<li>Tokenize the data into words and train Word2Vec Model</li>\n",
"<li>Evaluate the Quality of the Word Embeddings by comparing the Spearman Correlation on a Human Annoted Dataset</li>\n",
"<li>Use PCA to reduce the dimension of the Embeddings to 2 for Visualization</li>\n",
"<li>Use t-SNE incombination with PCA to improve the Quality of the Visualizations </li>\n",
"<li>Tokenize the data into words and train Word2Vec model</li>\n",
"<li>Evaluate the quality of the word embeddings by comparing the Spearman Correlation on a human annoted dataset</li>\n",
"<li>Use PCA to reduce the dimension of the embeddings to 2 for visualization</li>\n",
"<li>Use t-SNE incombination with PCA to improve the quality of the visualizations </li>\n",
"</ol>"
]
},
Expand Down Expand Up @@ -262,7 +262,7 @@
"metadata": {},
"source": [
"<b> Filter the Abstracts based on the dictionary loaded above</b>\n",
"<br> This will help to filter out abstracts that donot contain words that you care about\n",
"<br> This will help to filter out abstracts that do not contain words that you care about\n",
"<br> Its an optional preprocessing step"
]
},
Expand Down Expand Up @@ -525,7 +525,7 @@
"<ul>\n",
"<li> It runs in slighlty over 15 minutes when run on 15 million abstracts with a window size of 5, vector size of 50 and mincount of 400.</li><li> This performance is on a spark cluster with 11 worker nodes each with 4 cores.</li>\n",
"<li> The parameter values of 5 for window size, 50 for vector size and mincount of 400 work well for the Entity Recognition Task.</li><li> However optimal performance (Rho = 0.5632) during the Intrinsic Evaluation is achieved with a window size of 30, vector size of 100 and mincount of 400. </li><li>This difference can be attributed to the fact that a bigger window size does not help the Entity Recognition task and a simpler model is preferred.</li>\n",
"<li> To speed up the evaluation of word2vec change number of partitions to a higher value (&lt; number of cores available), but this may decrease the accuracy o fthe model.\n",
"<li> To speed up the evaluation of word2vec change number of partitions to a higher value (&lt; number of cores available), but this may decrease the accuracy of the model.\n",
"</ul>\n",
"<br> For the cell below we are using a cluster size of 4 worker nodes each with 4 cores. "
]
Expand Down Expand Up @@ -582,7 +582,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<b> Manually Evaluate Similar words by getting nearest neighbours </b>"
"<b> Manually evaluate similar words by getting nearest neighbours </b>"
]
},
{
Expand Down Expand Up @@ -1207,7 +1207,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<b> For a better Visualization we use PCA + t-SNE</b>\n",
"<b> For a better visualization we use PCA + t-SNE</b>\n",
"<br> We first use PCA to reduce the dimensions to 45, then pick 15000 word vectors and apply t-SNE on them. We use the same word list for visualization as used in PCA (above) to see the differences between the 2 visualizations"
]
},
Expand Down Expand Up @@ -1410,9 +1410,9 @@
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "NLP_DL_EntityRecognition local",
"display_name": "Python 3",
"language": "python",
"name": "nlp_dl_entityrecognition_local"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
Loading