I booked on to the CDCS Introduction to Text Analysis 2-hour course and installed nltk in preparation. This 2-hour “silent disco” online event is an intermediate Python course on using the NLTK package.
Introduction to Text Analysis
The workshop uses Notable but I ran the notebook in Visual Studio (because it’s a good IDE). The notebook is written in markdown text with code blocks interspersed in “cells” throughout the document.
The workshop begins with some exercises in the IDE to get used to running python code within the cells to do something with text strings, using variables, e.g.
first_name = "ada" last_name = "lovelace" full_name = first_name + ' ' + last_name # combine strings with a space inbetween print (full_name)
Basic file actions are rehearsed including opening and reading a file, and using text manipulation functions:
file = open("origin-intro.txt") txt = file.read() txt = txt.lower()
The Python package NLTK provides a set of natural languages algorithms e.g. tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. We played with some of these using the text extracts we had been given.
Text analysis begins with breaking down a block of text into smaller chunks such as words or sentences. This is called Tokenization.
import nltk from nltk.tokenize import sent_tokenize # or word_tokenize for words f = open("origin-intro.txt") # open file txt = f.read() # add file contents to variable tokenized_text=sent_tokenize(txt) print(tokenized_text)
The initial part of this process in preparing for analysis is:
- Open the file and assign it to a variable
- Convert to lower case
- Split into tokens
- remove stopwords
“Stop” words are sometimes called filler words: they are the common words that don’t add much meaning to a sentence.
import re from nltk.tokenize import word_tokenize f = open("origin-intro.txt") # open file contents = f.read() # add file contents to variable contents = contents.lower() # lower case text contents = re.sub(r'[^\w\s]','',contents) #remove punctuation (note use of regex) tokenized_word=word_tokenize(contents) filtered_word= for w in tokenized_word: if w not in stop_words: filtered_word.append(w) print("Filterd Words:",filtered_word)
Further functions and libraries
Working through the notebook, running code and downloading libraries as required, I was able to easily clean blocks of text; tokenize and count words; plot frequency distributions; tag parts of speech; identify common bigrams and n-grams (word pairs and n-word groups).
Fun with Trump
A delightful bit of fun was had analysing President Trump’s Tweets using this code to fetch the text:
import csv import urllib.request url = 'https://learn.edina.ac.uk/intro-ta/files/trump-tweet-archive.csv' # download the file csv = urllib.request.urlopen(url).read() # assign the contents of the file to a variable (csv) with open('tweets.csv', 'wb') as file: # create a new file and save the contents of 'csv' to this file file.write(csv) print('CSV file created')
What do you think was the top 4-gram in all of this data?
make america great again! 390
Of course it was.
This was a very nice way to run through a tutorial notebook and learn a few tricks in python for text analysis, with support running in the background if I needed it. The format is powerful; it focuses you on the task within the time window allocated and helps you avoid distractions. Although it’s self-study (and as such relies on good quality materials in the first place), it’s OK to play a bit too, because there’s help at hand if you come unstuck. Well done, CDCS.