Statistical Computing with Python - Upcoming Seminar: June 8-11, 2021, Remote Seminar - Code Horizons
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Statistical Computing with Python Jason Anastasopoulos, Ph.D. Upcoming Seminar: June 8-11, 2021, Remote Seminar
HTML and Markup Languages - Tree-structured (hierarchical) format - Elements surrounded by opening & closing tags. - Values embedded in open tags data
Web Scraping Basics - Use HTML and JSON data structures to build databases - JSON is used to extract: - Most data from APIs - Data exchange systems like databases (SQL and MongoDB)
Getting data Easiest: JSON from APIs HTML - Very difficult, last resort if not available in APIs (rare now) Other options: - Write a bot. - Pretend to be a browser (Selenium)
HTML Hyper Text Markup Language Formatting Web pages Uses tags
Example Page title here This is sample text... This is text within a paragraph. I really mean that
Webpage example view-source:https://anastasopoulos.io/research
urllib package Request package to retrieve file from ftp server Connect with web servers using http protocol Use of request and response data types
JSON JavaScript Object Notation Data interchange format “Lightweight” format - Data representations - Easy for users to read - Easy for parsers to translate
Main Structures Object - Uses {}, identical to a dictionary structure with key names and values separated by comma. Array - List structure - Uses [] - Contains values. Value - Lowest level. - Values such as strings, numbers etc.
Simple JSON Sample
Accessing JSON Data Data accessed via APIs formatted in JSON Easy to access using Python ‘json’ package Data accessed as in a dictionary.
Databases - Means of exchanging information. - SQL: Structured Query Language. - MongoDB: NoSQL database, uses JSON-like ways of storing data. - Brief code demonstration but each of these databases require more time to cover.
Statistical Computing in Python Unstructured Data and Natural Language Processing
Text processing 1.Tokenization - splits the document into tokens which can be words or n-grams (phrases). 2.Formatting - punctuation, numbers, case, spacing. 3.Stop word removal - removal of “stop words”
Tokenization “Bag of words” model - most text analysis methods treat documents as a big bunch of words or terms. Order is generally not taken into account, just word and term frequencies. There are ways to parse documents into ngrams or words but we’ll stick with words for now.
Tokenization Tokenized tweet (1 gram): [“I”, “don’t”, “think”, “you’re”, “the”....] Tokenized tweet (2-gram): [“I don’t”, “don’t think”, “think you’re”, “you’re the”, …] “
Stop words Stop words are simply words that removed during text processing. They tend to be words that are very common “the”, “and”, “is” etc. These common words can cause problems for machine learning algorithms and search engines because they add noise. BEWARE Each package defines different lists of stop words and sometimes removal can decrease performance of supervised mechine learning classifiers.
Sentiment Analysis - Sentiment analysis is a type of supervised machine learning that is used to predict the sentiment of texts. - Without going into too much detail, we will use what is known as a pretrained sentiment analysis algorithm. - This is basically how it works...
Sentiment Analysis
You can also read