WaliyaYohannaJoseph / text-analysis

Session on text analysis with NLTK, including discussion of cleaning data, creating text corpora, and analyzing texts programmatically.

Introduction to Text Analysis with Python and the Natural Language ToolKit (NLTK)

Digital technologies have made vast amounts of text available to researchers, and this same technological moment has provided us with the capacity to analyze that text. The first step in that analysis is to transform texts designed for human consumption into a form a computer can analyze. Using Python and the Natural Language ToolKit (commonly called NLTK), this workshop introduces strategies to turn qualitative texts into quantitative objects. Through that process, we will present a variety of strategies for simple analysis of text-based data.

By the end of this workshop, you will be able to:

  • Identify strategies for transforming texts into numbers
  • Explain what a concordance is, how to find one, and why it matters
  • Compare frequency distribution of words in a text to quantify the narrative arc
  • Explain what stop words are and why they are often removed
  • Remove stop words in a variety of languages
  • Utilize Part-of-Speech tagging to gather insights about a text
  • Transform any document that you have (or have access to) in a .txt format into a text that can be analyzed computationally

Get started >>>

Text as Data
Cleaning and Normalizing
NLTK Methods with the NLTK Corpus
Searching For Words
Positioning Words
Built-In Python Functions
Making Your Own Corpus: Data Cleaning
Make Your Own Corpus
Part-of-Speech Tagging

Session Leader: Michelle A. McSweeney, PhD and Rachel Rakov
Based on previous work by: Michelle A. McSweeney, PhD and Rachel Rakov

Creative Commons License

Digital Research Institute (DRI) Curriculum by Graduate Center Digital Initiatives is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/DHRI-Curriculum. When sharing this material or derivative works, preserve this paragraph, changing only the title of the derivative work, or provide comparable attribution.

Contributors kallewesterling michellejm smythp rachelrakov dhinstitutes kchatlosh lmrhody story645

Last run failed with status code 255.

Console output of last run

Can't find scraper code. Expected to find a file called scraper.rb, scraper.php, scraper.py, scraper.pl, or scraper.js in the root directory


Total run time: less than 5 seconds

Total cpu time used: less than 5 seconds

Total disk space used: 2.24 MB


  • Manually ran revision 57e40c54 and failed .
  • Manually ran revision 57e40c54 and failed .
  • Created on morph.io

Scraper code