From cortisone to graphene: 60 years of breakthroughs in PubMed publications
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
From cortisone to graphene: 60 years of breakthroughs in PubMed publications André Panisson, Marco Quaggiotto Data Science Laboratory, ISI Foundation, Torino, Italy This report refers to the methodology used to create the visualization entitled “From cortisone to graphene: 60 years of breakthroughs in PubMed publica- tions”, submitted to the Data Visualization Challenge of the 2014 ACM Web Science Conference. Biomedical science advances can often be described by brief periods of dis- covery followed by decades of study and research. This analysis aims to show an overview of the breakthroughs inferred from the titles of 21.5 millions of PubMed publications. For this visualization, we built a timeline of terms that have experienced a sudden increase in popularity in PubMed publication titles. The position of the terms is related to the years in witch the breakthrough has occurred, the size and color of the word are indicative of the popularity of the term. Authors that contributed significantly to the breakthroughs have been linked to their areas of research. In order to create the visualization we: 1. Extracted words and bigrams from publication titles 2. Analyzed the occurrence of such terms in the years between 1950 and 2010 3. Detected peaks of activity for each term, identifying the years with high acceleration in the term usage 4. Filtered out terms with low entropy as they are too generic 5. Identified related authors (with at least 100 publications, of which at least 50 contain the given term in the title) 6. Created a weighted graph with years, terms and authors 7. Used a force-directed layout to position the terms next to their years and authors next to their related terms In the next sections, we discuss this methodology in detail. 1
1 Dataset PubMed comprises more than 23 million citations for biomedical literature from MEDLINE, life science journals, and online books. In this work, we used meta- data for the complete set of all PubMed records, including title, authors, and year of publication. All data provided originates from NLMs PubMed database (as downloaded April 24, 2013 from the NLM FTP site) and was retrieved via the Scholarly Database 1;2 . The MEDLINE R /PubMed R dataset contains articles starting before 1900, however the number of publications is very small for each year in the 19th century (less than 1000 per year). After 1900, the number of publications per year grows above 1000, but a big increase starts in the 1940s, when the number of publications jumps from 3357 in 1944 to 17148 in 1945. Figure 1 shows a timeline with the number of publications in the dataset for each year, starting from 1900. 0.9M 0.8M 0.7M Number of publications 0.6M 0.5M 0.4M 0.3M 0.2M 0.1M 0.0M 1900 1920 1940 1960 1980 2000 2020 Year of publication Figure 1: Number of publications in the MEDLINE R /PubMed R dataset start- ing from 1900 In order to limit the analysis to years with high number of publications, we decided to show in the visualization only data from 1950 to 2010. 2 Text analysis Our method starts with text analysis of the article titles. In order to analyze the article titles, we use the “bag of words” approach, and we create a matrix A ∈ RN ×M , where N is the number of articles and M is the number of terms extracted from the dataset. Each position ai,j represents the number of times the term j appears in the title of article i. In order to extract the terms, we use the steps that we show in the following. We illustrate each step with the example of the title ‘Drugs and the high cost of health care.’: (1) choose a list of stop words (common words in English and in article titles): [is, and, the, of, ..., letter, article, conference, ...] (2) extract a list of words from the article title: [drugs, and, the, high, cost, of, health, care] 2
(3) lemmatize the words of the list (e.g., drugs becomes drug): [drug, and, the, high, cost, of, health, care] (4) remove the stop words and create n-grams with maximum size of 2 by concatenating words only if they are not separated by a stop word: [drug, high, cost, health, care, high cost, health care] For each article title, we apply the above process and add the resulting terms to a counter. The result is a dictionary that associates each term with the number of times the term appears in the dataset. We select only the 100 thousand most common terms and proceed to the construction of matrix A, where the number of terms M is 100 thousand. 3 Time series analysis Since we want to select terms that could represent discoveries and breakthroughs, we analyze the time-series of the occurrences of each term over the years. For each term, we start by creating a time series with the number of term occurrences in each year. For example, the term hiv exhibits a time series of occurrences as shown in Figure 2. 8000 7000 6000 nr of ocurrences of hiv 5000 4000 3000 2000 1000 0 1950 1960 1970 1980 1990 2000 2010 year Figure 2: Time series with number of occurrences of the term hiv from 1950 to 2010. In this figure, we see the increasing occurrence of the term hiv, but this time series is affected by the increasing number of articles in the dataset. In order to remove the effect of increasing number of articles, we normalize the frequencies by dividing the term occurrence by the sum of all term occurrences in each year. This gives us the time series with the relative frequency of the terms, in the following referred as f (t). For the term hiv, we show its f (t) in Figure 3. We derive the following time series from f (t): δ(f (t)) = f (t) − f (t − 1) 2 δ (f (t)) = δ(f (t)) − δ(f (t − 1)) In this example, δ 2 (f (t)) corresponds to the second discrete derivative of f (t), i.e., the years where the frequency acceleration is at its maximum. 3
0.012 0.010 relative frequency of hiv 0.008 0.006 0.004 0.002 0.000 1950 1960 1970 1980 1990 2000 2010 year Figure 3: Time series with relative frequency (f (t)) of the term hiv from 1950 to 2010. f (t) ϕ(f (t)) = f (t − 1) + ε 1−ϕ(f (t)) ω(f (t)) = 1 − e C g(t) = δ 2 (f (t))δ(f (t))ω(f (t)) ε is a small number used to avoid zero division. The parameter C is used to give less weight to frequencies that jump from high values to higher values, and more weight to frequencies that jump from very low values to high values. In our experiments, we used C = 10. The resulting time series g(t) is given by the product of three time series: δ 2 (f (t))δ(f (t))ω(f (t)). We multiply δ 2 (f (t)) - the second discrete derivative of f (t) - by δ(f (t)) - the first discrete derivative of f (t), since we want the years where there is a high acceleration in the term usage and also the years where the increase in term usage is positive. We multiply by ω(f (t)) in order to give less weight to the increases that have a relative small change from t − 1 to t. Finally, we set all negative values to 0 in order to keep only positive peaks. g(t) has peaks in the years where the frequency goes from near zero to a high value. In order to classify the terms by their frequency jumps, we use a measure of entropy. In Figure 4, for some terms, we show in blue the initial time series with the relative frequency -f (t), in red the resulting g(t) for the term, and we show next to the term name the resulting entropy calculated over g(t). Using this methodology to analyze the time series of each term, we are able to extract two types of information: 1. Most of the terms with high entropy (next to 1) are terms that are used in all years, or that have no visible jump in the frequency. By selecting only terms with low entropy and high accumulated frequency, we have a good chance to extract terms that represent discoveries and breakthroughs from 1950 to 2010. 4
0.09 hiv (0.820) 0.5 graphene (0.251) 0.25 cortisone (0.193) 0.08 0.07 0.4 0.20 0.06 0.05 0.3 0.15 0.04 0.2 0.10 0.03 0.02 0.1 0.05 0.01 0.00 0.0 0.00 1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010 0.030 cell (0.999) 0.08 global (0.999) 0.030 variable (0.999) 0.025 0.07 0.025 0.06 0.020 0.05 0.020 0.015 0.04 0.015 0.010 0.03 0.010 0.02 0.005 0.01 0.005 0.000 0.00 0.000 1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010 Figure 4: For some terms, we show (i) in blue the time series with the relative frequency (f (t)), (ii) in red the processed g(t) for the term, and (iii) next to the term name, the resulting entropy calculated over g(t). 2. Terms with low entropy have high peaks that can be associated to the year of their frequency increase. The values in g(t) associated to each year can be used as a weight to associate a given term to the year of its discovery. 3.1 A small validation One simple way to validate our method for finding the discoveries associated to each year is to check if terms associated to Nobel Prizes are present in the extracted terms. The 1950 Nobel Prize in Medicine was given to Edward Calvin Kendall, Tadeus Reichstein and Philip Showalter Hench “for their discoveries relating to the hormones of the adrenal cortex, their structure and biological effects”. The term associated to this discovery is “cortisone”. The 2010 No- bel Prize in Physics was given to Andre Geim and Konstantin Novoselov “for groundbreaking experiments regarding the two-dimensional material graphene”, and the associated term is “graphene”. These two terms motivated the title of our work: “From cortisone to graphene”. We were able to verify that both terms are present in the visualization. Other terms like “cell”, “global” and “variable” have high frequency in the dataset, but since they present low entropy for their corresponding g(t), they were excluded from the visualization. 4 Author selection The MEDLINE R /PubMed R dataset contains around 10.7 million author names associated to the articles comprised between 1950 and 2010. Each author name is expressed as ‘Surname, Name’ (e.g., ‘Smith, John’) or as ‘Surname, intitals’ (e.g., ‘Smith, J’). From the articles and author names, we can build a matrix B ∈ RN ×O , where N is the number of articles and O is the number of author names. Each position bi,j is set to 1 if the author name j is associated to article i, and is set to 0 otherwise. 5
The author name associated to the highest number of articles is ‘Suzuki, T’. Clearly, this author name represents more than one real author. In fact, the page of PubMed Advanced Search shows that there are tens of authors with surname ‘Suzuki’ and initial ‘T’. In order to create a list of authors associated to each term, we wanted to avoid using author names that do not refer to a single author. For this, we also used a measure of entropy for each author. The idea behind it is that author names that are associated to a high number of terms are most probably different authors, since authors are committed to few topics. The entropy of each author is calculated using the following equation: M X Hj = − p(i|j) log(p(i|j)) i=1 where p(i|j) is the probability of having a term i given an author j. From this, we calculate a scaled entropy: Hj Ej = 1 − Hmax Figure 5 shows an histogram with the distribution of number of authors for each bin of Ej values and the number of removed author names when selecting only author names with Ej above a threshold t (Ej > t). We verified that setting Ej > 0.25, we still keep more than 99.7% of the author names and, at the same time, we remove all top author names that refer to more than one real author. 600000 1.0 500000 0.8 percent of authors removed 400000 number of authors 0.6 300000 0.4 200000 100000 0.2 0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 enthropy enthropy threshold Figure 5: In the left, an histogram showing the distribution of number of author names for each bin of Ej values (we used 100 bins). In the right, the number of removed author names when selecting only author names with Ej above an entropy threshold t (Ej > t). From the list of author names with Ej > 0.25, we identified only the authors with least 100 publications. Next, use the document-term matrix A to calculate the co-occurrence of authors and terms: C = AT B, with C ∈ RM ×O . From the matrix C we select only the rows that refer to the terms chosen for the visualization. Finally, for each row, we select only the top 2 authors which used at least 50 times the given term in their articles. 6
5 Graph creation and layout We used the years, terms and authors to create a weighted graph as follows: • V1 is the set of nodes representing the years, i.e., from 1950 to 2010 (61 nodes). • V2 is the set of nodes representing the selected terms. • V3 is the set of nodes representing the selected authors. • E12 is a set of weighted edges connecting years with terms. The weights are proportional to the values of the g(t) corresponding to each term. • E22 is a set of weighted edges connecting terms with terms. The weights are proportional to the co-occurrence of the terms in article titles. • E23 is a set of weighted edges connecting terms with authors. The weights are proportional to the number of times an author used the term in article titles. • V is the union of V1 , V2 and V3 . • E is the union of E12 , E22 and E23 . G is a graph with nodes V and edges E. This graph was exported to a file and imported to a graph exploration tool (in this case, Gephi 3 ). Finally, we set the year nodes to fixed positions, and used two different layout algorithms to distribute spatially the term and author nodes: Force Atlas and Label Adjust. The size and color of term nodes were set proportionally to the term frequency, as shown in Figure 6. Hardie, D Grahame Viollet, Benoit mitogen activate cyclin fluorescent protein factor kappab activate protein overexpression knockout matrix metalloproteinase Li, Wen-Jun O'Hair, Richard A J Minamino, N Davis, M M tumor suppressor Devroey, P alpha1 Van Steirteghem, A intellectual disability lamivudine soil gas phase vibrational natriuretic peptide bariatric Ginsberg, M H cell receptor factor alpha integrin Baird, A asymmetric Shibasaki, Masakatsu dendritic Zhou, Qixing intracytoplasmic rna interference nanocomposite recombinant human exon cyclin dependent endovascular treatment catenin sediment von Boehmer, H continuous ambulatory Emmrich, F Matsumoto, Kunio Presta, M Humphries, M J Umezawa, Kazuo drug discovery pegylated Christou, George allograft cd4 endothelin basic fibroblast base medicine effluent sars sister chromatid phosphoinositide Burnett, John C Tilney, N L hepatocyte growth Chen, Shih-Ann antisense integrins Mantzoros, Christos S kappab Aggarwal, Bharat B phase transition coronary intervention single molecule burgdorferi Myers, Martin G Low, John N hydrogen bond raft amphiphilic monoclonal trisphosphate gene delivery wastewater wall carbon Nakamura, Toshikazu polyarthritis catheter ablation care reform apoptotic cell her2 microarray Bown, S G tubercle bacillus adrenaline Häyry, P color doppler transesophageal echocardiography adiponectin photodynamic therapy propofol fatigue syndrome Calkins, Hugh leptin photocatalytic Glidewell, Christopher gastrointestinal stromal Kihara, Shinji nmda Wernsdorfer, Wolfgang september symptomatology inquiry Blumberg, P M orthotopic liver antiphospholipid syndrome thoracoscopic cd44 regulate kinase Taketo, Makoto M electronic structure fluoxetine prion resonance image software synthesis characterization Schiffrin, E L lithotripsy Davenport, A P west nile Aguzzi, Adriano egfr therapeutic effect diaphragmatic birthday inoculate endobronchial metabotropic glutamate nurse practitioner bipolar disorder cd25 chain reaction signal transduction Herrlich, P nucleotide polymorphism toll like lanthanide Stang, Peter J Funahashi, Tohru biologic nitric oxide neural stem Marks, F Wilske, B Cervera, R Prusiner, S B legionella pneumophila zeolite Wood, B P Young, L W personal experience niche france Glick, P L Tibboel, D bufo possibility pathogenesis Vale, W adenovirus monoamine hodgkin lymphoma immunosorbent natural killer intravenous immunoglobulin phorbol ester Fikrig, E helicobacter pylorus helicobacter workforce Vieta, Eduard Calabrese, Joseph R total synthesis Nicolaou, K C nile self assembly self assemble reader Kim, Seung U observational study tooth Zewail, Ahmed H tuberculous Blumberg, B S McCann, S M phi annexin deficient mouse stereoselective ghrelin ultrafast quantum dot Pao, William radiological Miyamoto, E Graham, David Y Steinman, R M Seiki, M ionize radiation compute tomography Perretti, Mauro van de Beek, Diederik vegetative streptokinase tumor necrosis cd3 protein tyrosine icam Banchereau, J transcriptome meningitis Corma, Avelino theoretical study Fürstner, Alois intramolecular Moretta, A calmodulin dependent endocrine orthodontic Sinsheimer, R L Blaser, M J nf kappab Pestell, Richard G Bechstein, W O radiofrequency ablation pten 21st century multidetector h2o italian Bando, Yoshio gene therapy Armstrong, B K Alitalo, Kari O'Neill, Luke A J mammary carcinoma hypoglycemicCuriel, David T Doerfler, W corticotropin aortic stenosis Cerhan, James R macrophage colony indication Choi, DongilSpringer, T A Miyake, Kensuke spontaneous pneumothorax Thesleff, I trh Herberman, R B Salas, M Noelle, R J cd40 hormone release nanotube Terhorst, C immunodeficiency syndrome metabolic syndrome Kavoussi, L R australia Cozen, Wendy Steinman, Ralph M orthodontic treatment endothelial growth signal regulate Rotello, Vincent M uranium immune complex relate quality mapk akt association study Cademartiri, Filippo Gill, Inderbir S Kojima, Masayasu pneumothorax Gottlieb, E L Topol, E J malpractice resveratrol mitral stenosis meningeal cell therapy guerin liver transplantation Solcia, E Hosoda, Hiroshi intestinal mucosa laparoscopic desorption ionization club Robbins, P D McCracken, G H Gold, P W Oberg, K appliance cardiac transplantation alpha fetoprotein list clinical medicine bromo april remote asian Pibarot, Philippe echo polychlorinated biphenyl Arimura, A endorphin forum cyclosporine subtraction Brimacombe, J reform human herpesvirus highly active catalyst Kobayashi, Shu Diamond, Michael S like receptor Grundy, Scott M Mirkin, Chad A pneumoperitoneum mycobacterium tuberculosis dentistry Negri, E Chanock, Stephen J assist laser nanoparticle Achenbach, Stephan bcg La Vecchia, C Stinson, E B carcinoembryonic midazolam Oreopoulos, D G laryngeal mask single nucleotide dentist genome sequence Kowitz, A mycoplasma adenylate cyclase ranitidine minimally invasive online toll Auricchio, Angelo antihistamine gynecological hyaluronidase Malendowicz, L K Storb, R Misra, Anoop prednisone amphotericin regime hong peer review Robberecht, P link immunosorbent Svehag, S E refugee Van Buren, C T htlv Karin, M shortage gabaa receptor mediate apoptosis privacy photonic climate change Kubo, Michiaki Lapidus, Alla pylorus Bennett, W M borrelia burgdorferi Kerr, Richard A amp hybridoma coronary angioplasty edge resynchronization systematic review Wiesner, R H campaign cor Rastogi, Nalin casualty Loevy, H T massachusetts tetrahydrocannabinol Temin, H M Reis, Rui L oncogene Gessain, A evidence base Vacanti, J P typhoid adrenal cortex aplastic anemia technetium 99m Bishop, J M Wherry, E John dna vaccine cyclization xii synergism soviet eeg immunosuppressive fish oil cd8 polyelectrolyte Taylor-Robinson, D argentina Jacobs, William R Collen, D styrene year result Iijima, Sumio mtor pi3k Seo, Pil Ja automate jun Cooks, R Graham Thomas, E D isoniazid prednisolone imipramine Hardie, D G bromocriptine zidovudine Ahmed, Rafi chemokine receptor beta catenin Schalij, Martin J Zhao, Dongyuan experimental research technic function test rheumatic disease Kobayashi, G S virus vaccine actinomycin immune deficiency tissue engineer photonic crystal spectrum disorder microrna hiv Job, C K rubella kinase pathway dynamical proteomic Razin, S Budoff, Matthew J Tseng, L F retrovirus Schlom, J shock wave proto oncogene oxy pathological transaminase radioimmunoassay mesoporous Higgins, C B Bhatia, Kailash P cell mediate Muasher, S J Clotet, Bonaventura map kinase ambulatory blood Ferrara, N splice variant inducible nitric Butterfield, D Allan Ernst, Edzard compute fluconazole Jusko, W J thrombolytic dentition cyclosporin tunable Tobey, J A legislation Brin, M F echinococcosis hormone therapy charles leprosy ehrlich news Medoff, G Hilleman, M R dental caries mycoplasma pneumoniae Sever, J L Saletu, B Chase, T N Yamada, K M natriuretic factor Hricak, H yag laser Rosenwaks, Z proto kawasaki tat receptor gamma Mancia, G rapamycin aquaporin Weiner, David B electrospray ionization dendritic cell imatinib time pcr nanoscale Möhwald, Helmuth nanostructure oxide synthase biomarker Jerrold, Laurance dystonia erythromycin reserpine haloperidol drug resistance poultry Best, J M Pfurtscheller, G levodopa Clark, B F Schally, A V 125i Mosher, D F lymphocyte subset yag patient safety millennium nile virus rituximab Fanti, Stefano Shi, Jianlin bevacizumab pulmonary tuberculosis furosemide vitro fertilization Jonas, Jost B rheumatism cortisone defense Ramu, G lead poison triamcinolone MUHLER, J C elongation factor 99mtc somatostatin Li, C H fibronectin captopril Gallo, R C gene relate simvastatin Pfaller, M A tnf alpha mrsa Huang, P L cyclin d1 diffusion weight severe acute Blennow, Kaj Pitluck, Sam manage pet ct Kantarjian, Hagop butyl anabolic Oreland, Leung, Gabriel M innate immune Zetterberg, Henrik 1950 bronchial carcinoma orthodontics tanzania heart transplant Parmeggiani, A cimetidine hdl acyclovir poverty octreotide Moon, Randall T amenorrhea kda Youdim, M B Moncada, S Nishida, E salmonella enterica 1960 1970 silicosis memoriam flow rate Lam, T H MacIntyre, I Sodroski, J reverse transcription 1990 Richman, D D Rubello, Domenico 1980 Potter, B V Lee, Myeong Soo L Houslay, M D Reubi, J C Ferrone, S Klein, T W Weissenstein, E Harris, A G Gisbert, J P Khamashta, M A gabaa maldi liquid crystal Holmes, Aebersold, Ruedi elevation myocardial benzyl personalize 2000 Soriano, Vincent Kaminskaia, G O internal medicine VIDAL, J King, S B Nahorski, S R Maggi, C A Aggarwal, B B Wessely, S Billiar, T R Agre, P Curiel, D T West, K W David R human embryonic iranian Calin, George A 2010 puerperium 1951 1952 1953 1954 1955 1956 1957 1958 1959 1961 1962 1963 1964 1965 1966 1967 1968 1969 1971 1972 1973 1974 1975 1976 1977 1978 1979 1988 1991 1992 van Gunsteren, W F Ng, Seik Weng Volinia, Stefano graphene typhoid fever b12 1981 1982 1989 1993 1994 surgical therapy physiopathology 1983 1984 1985 1986 1987 1995 1996 1997 1998 1999 2001 2002 2003 2004 2005 2006 2008 pluripotent crown medicare sulfoxide 2007 Nussdorfer, G G paris dental pulp adoption Conn, P M 2009 Tour, James M cea methadone Grimaldi, P L professor Stitzer, M L vinyl McCulloch, E A Kreek, M J Hoppin, Jane A Magrath, I T beta endorphin cftr clinical aspect reactive protein cholesterol level burkitt Zamcheck, N Ziegler, J L Conn, P Jeffrey allergic disease FLOCH, H laparotomy hong kong ambulatory peritoneal payment cytokine production Soderling, T R release hormone calcitonin gene anti hiv 1beta De Clercq, Erik pathogenic tnf Pannecouque, Christophe blast layer chromatography McEver, R P process citation agricultural triiodothyronine Chopra, I J glutamate receptor transition metal Hogg, Robert S Ruoff, Rodney S Bocci, A Tonks, N K organic matter tio2 french civil relaxant anticholinergic Ridker, Paul M test result adriamycin computerize tomography zinc finger gemcitabine antiretroviral therapy genomics sirolimus solar cell Lord, Catherine layer Soriano, V Lieberman, J A Prato, Maurizio peculiarity family plan chloro gastroduodenal korean york city rock Goldenberg, D M Amyes, S G prostacyclin jejuni Rivier, J release factor deficiency syndrome hiv infect immunodeficiency virus quantum meth Sandler, Dale P Wainberg, M A Leon, Martin B Yamanaka, Shinya dental practice tubercle nitric Pfeiffer, L N july Oppenheimer, J H Pepys, M B Thielemann, H selectin Grätzel, Michael mitral laparoscopic surgery molecular dynamic zno Strano, Michael S open heart Tremblay, Michel L knockout mouse Cortes, Jorge mmp river Miller, D Craig dental Hercberg, Serge security white rat han Alfieri, Ottavio enterovirus body composition trimethoprim mediate cytotoxicity Dang, C V Peters, G J Sorsa, T hif silico Katoh, Masaru dft Fun, Hoong-Kun microbiota mir disk hypothermia ketamine Lee, Hyejung address profound health plan Gurney, B F rifampicin observational Vane, J R legionella etoposide Howell, S B Gutkowska, J Cantin, M docetaxel Tedder, T F p38 Ivanov, A Jang, Sung Ho Jones, R N president robert Kim, Jong-Won Galea, Sandro Papakostas, George I hyperbaric cyclic amp Dumont, J E acupuncture glycosylated Greco, F A quality improvement rt pcr Georgoulias, V Andersson, Gerhard Barceló, Damià collision initio ruthenium Klionsky, Daniel J h1n1 Mizushima, Noboru leukotriene cell clone p450 missensematrix assist internet induce apoptosis cell apoptosis gold nanoparticle McCammon, J Andrew depressive dexamethasone Sleytr, U B Stone, Gregg W vietnam human immunodeficiency dynamic simulation iran Barry, A L Newman, Anne B medicinal plant Ki, Chang-Seok associate antigen biphenyl focal Harris, Tamara B Herbert, V hemorrhagic fever kong concanavalin adjuvant chemotherapy Edelstein, P H nmda receptor clozapine Knochel, Paul diffusion tensor tuberculous meningitis clinical result anatomo belgium femoral neck hbsag sludge extracellular signal Katoh, Masuko medico meta analysis Narayanan, P R Kimmel, K Waldman, H B Ford-Hutchinson, A W Gibson, C Michael heart transplantation diagnostics Shulman, S T tetracycline serum cholesterol Vlahov, David dopa prostaglandin e2 enkephalin autophagy tetra Motoyoshi, K Waksman, Ron auxin Dobryszycka, W Meyer, Helmut E model base drug elute Maruoka, Keiji partial denture teeth buffalo monoclonal antibody ciprofloxacin ecosystem Iannuzzi, L massage transesophageal tuberculosis Wood, Robin ultrasonic ussr prof Misu, Y presidential october myc amyloid precursor Beyreuther, K bmp functionalized synuclein extracorporeal sevoflurane Fava, Maurizio Thompson, John F cell leukemia cisplatinGolomb, Eisenman, R N polychlorinated paclitaxel beta1 alpha2 proteome single wall Friml, Jirí Masters, C L Hollenberg, Morley D atrial natriuretic Ratcliffe, Peter J hydrocortisone Golub, L M Walle, T Gowda, B Thimme Levy, S B apc traditional chinese Mori, Susumu hyperbaric oxygen dental care root canal nurse care 17beta nuclear antigen beta2 sentinel genetic variant mediate immunity erythropoietin Thorsby, E surface plasmon phenyl mass index Kunkel, S L hydatid cyst precede Declerck, P J chromatography tandem invite adhesion molecule Marsh, S G chlorpromazine Saijo, N transduction pathway Ross, Merrick I Trost, Barry M Lee, Young Ho ivf mr image provoke König, K G ct scan telomerase hydroxydopamine fetoprotein Chrambach, A extracorporeal circulation Bartlett, R H Ozsoylu, S haptoglobin term memory dental health stellate cell enterica Cannon, Christopher P Ronco, Claudio propranolol minimally Coutinho, R A Baeuerle, P A vitamin b12 dihydroxyvitamin pneumophila homosexual Samuelsson, B transluminal HM activator inhibitor caries hla respiratory syndrome Montori, Victor M acute kidney infliximab biofilm chiral phen autism spectrum preliminary note stock inositol polymerase chain Fisher, mycophenolate Harvengt, C Lindner, Wolfgang picture immunosorbent assay nf kappa van Soolingen, Dick tcr hcv Schachner, M electrophoresis monoamine oxidase dependent kinase activate receptor Taketa, K claim neoplasm lupus erythematosus Deliens, Luc Tsuru, H Norman, A W Edelman, G M Calhoun, Vince D wastewater treatment accord czech denture chromatid exchange doppler echocardiography elute DeLuca, H F chloramphenicol actual carcinoembryonic antigen hospice hla dr acquire immunodeficiency Loskutoff, D J antiphospholipid Shay, 1alpha van Duijn, Cornelia M thalidomide microsatellite Guengerich, F Peter Peacock, Sharon J Narod, Steven A Levey, Andrew S image guide hiv infection JW Bowen, W H affinity chromatography hla class sentinel lymph fmri target therapy supramolecular Knoll, Wolfgang genome wide transgenic mouse burkholderia mild cognitive JW Meijer, E W sedimentation fifty english Fehm, H L ferment eczema lipoprotein cholesterol Feingold, M Goldwasser, E campylobacter jejuni gamma interferon adenoviral hypotension disease cause clinical picture catechol methylprednisolone proteomic analysis Isenberg, D A Vandamme, Peter Righetti, P G Mitchell, M D adjuvant therapy oximetry stat3 active antiretroviral reactor Schoepp, D D coronary syndrome Petersen, Ronald C wave natriuretic density functional Serruys, Patrick W calmodulin Werner, Liliana antibiotic therapy Williams, H C Means, A R nifedipine bulimia nd yag Zeuzem, Stefan hiv type Shay, Jerry W stat radiologic acth cure thirty Lubinski, Jan especially igf Montaner, Julio S G Tunnessen, W W scid intraocular lens necrosis factor antiretroviral irinotecan Madias, John E metabotropic p53 elute stent phen yl Kawaoka, Yoshihiro erythematosus Hayaishi, O Shultz, L D Semenza, Gregg L Rivadeneira, Fernando Yu, R K Brinster, R L coleoptera prostaglandin lepidoptera endothelial nitric biomass Palmiter, R D Webster, R G amiodarone Mitchell, J E Staels, Bart inducible factor pandemic influenza Ohman, E Magnus Fujimoto, James G eosinophil blood protein animal experiment Apple, D J gangliosides Guengerich, F P Choi, Yung Hyun o157 brca1 innate immunity ab initio coherence tomography Jove, Richard Colombo, Antonio wnt Levin, Adeera anniversary Gleich, G J hydrazide interferon gamma colony stimulate Metcalf, D rflp Rosenfeld, R G pylorus infection Kroemer, G atorvastatin clopidogrel Angiolillo, Dominick J Hoveyda, Amir H Feng, Xiaoming Venge, P factor beta chronic fatigue LeRoith, D high throughput excite state nano apoptosis Karch, H Guldi, Dirk M st elevation chronic kidney digestive multiple sclerosis autocrine Ungerstedt, U cytochrome p450 immunodeficiency Oren, M jnk vegf olanzapine throughput fullerene acute respiratory ionic liquid Gurbel, Paul A Clevers, Hans Semenza, G L gm csf Westerink, B H Lane, D P Komaroff, A L Ito, Osamu mesenchymal stem Miyazono, K Bellomo, Rinaldo Miller, D H Filippi, M cdna encode microdialysis Roberts, A B tyrosine phosphatase tgf beta cochlear implant enantioselective hypoxia inducible pcr assay Tohen, Mauricio Alivisatos, A Paul hmo Tyler, R S kawasaki disease tgf telemedicinerisperidone apoptotic Krajewski, S intercellular adhesion Tettamanti, G Letvin, N L Fujimoto, J G Sporn, M B Klip, A Merrell, Ronald C Antman, Elliott M avian influenza carbon nanotube catalyze Creager, Mark A Ferrara, Napoleone Traska, M R Kenkel, P J granulocyte macrophage hiv positive glucose transporter bcl Alitalo, K wetland anthrax single center functional theory Leppla, Stephen H manage care pcr Swayne, David E Rao, Mahendra S Nakatsuji, Norio Pawson, T mouse lack american college Broxmeyer, H E laparoscopic cholecystectomy snp microfluidic nanocrystal nanowire Salvesen, Guy S embryonic stem percutaneous coronary Doarn, Charles R antiphospholipid antibody caspase Smith, Sidney C web base phase microextraction transform growth Gelber, R D Akerblom, H K node biopsy ppar cholecystectomy Yang, Peidong tyrosine kinase oxide production Vandenabeele, P tacrolimus Pawliszyn, Janusz Mascitelli, Luca expression profile Capua, Ilaria Devarajan, Prasad kidney injury study group press bax Beller, Matthias Wang, Zhong Lin inflammatory cytokine Korsmeyer, S J proteasome Strieter, R M chemokine statin Kastelein, John J P sirna coronavirus Hughes, G R cd34 web Dou, Q Ping interleukin 1beta coli o157 proliferator activate Lai, M M Wahli, Walter groundwater palladium Figure 6: The graph G with year nodes, term nodes, author nodes and their connections. We don’t show year-term and term-term connections, since there are too many connections between these nodes. 6 Future improvements Many techniques for automatic term recognition (ATR) are available in the literature 4 . The approach for term extraction used in this study is very simple, and a better approach could improve the results of this work. For example, the terms human, immunodeficiency and virus are part of a single multi-word term: ‘human immunodeficiency virus’. However, the term immunodeficiency is extracted as a relevant term. ‘human’ and ‘virus’ are not recognized as relevant, since they are too broad. However, both ‘human immunodeficiency’ and ‘immunodeficiency virus’ are extracted as relevant terms. The full multi-word term is not extracted, since we limited the n-grams 7
to a maximum of 2 words. We can find other examples of terms that are affected by this problem, and it could be solved by using c-value for term extraction 5 , a method that combines linguistic and statistical information to extract multi- word terms. Another problem results from the choice of using lemmatization in the term extraction phase. For example, the term ‘ionize radiation’ refers to ionizing radiation, but the word ‘ionizing’ was lemmatized in the process, and was maintained in this form for all the process to the visualization. This problem could also be avoided by using better term extraction approaches. Since this is an unsupervised method to extract terms and construct a visu- alization, the method is susceptible to errors and noise in the data. For example, the terms ‘july’ and ‘possibility’ were considered as relevant terms, even if they can’t be associated to any discovery or breakthrough. A semi-supervised methodology, where the help of a domain expert could be used to clean up the data, would be preferred in this case. Another way to validate the terms extracted by our approach is using the UMLS vocabulary, which is also extracted from the MEDLINE R /PubMed R dataset 6 . This vocabulary integrates over 730,000 biomedical concepts, and some visual approaches were proposed for exploring semantic groups in this data 7 . Such vocabulary can be used to verify the validity and relevance of terms. References [1] R. P. Light, D. E. Polley, and K. Borner, “Open data and open code for big science of science studies,” in Proceedings of International Society of Scientometrics and Informetrics Conference, pp. 1342–1356, 2013. [2] G. L. Rowe, S. A. Ambre, J. W. Burgoon, W. Ke, and K. Borner, “The scholarly database and its utility for scientometrics research,” Scientometrics, vol. 79, May 2009. [3] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open source software for exploring and manipulating networks.,” in ICWSM, The AAAI Press, 2009. [4] K. Kageura and B. Umino, “Methods of automatic term recognition: A review,” Terminology, vol. 3, no. 2, pp. 259–289, 1996. [5] K. Frantzi, S. Ananiadou, and H. Mima, “Automatic recognition of multi-word terms: the C-value/NC-value method,” International Journal on Digital Libraries, vol. 3, no. 2, pp. 115–130, 2000. [6] O. Bodenreider, “The Unified Medical Language System (UMLS): integrating biomedical terminology,” Nucleic Acids Research, vol. 32, no. Database-Issue, pp. 267–270, 2004. [7] O. Bodenreider and A. T. McCray, “Exploring semantic groups through visual approaches,” Journal of Biomedical Informatics, vol. 36, no. 6, pp. 414 – 432, 2003. Unified Medical Language System. 8
You can also read