Random Text Introduction to Python Exercise
"United Colonies are, and of consanguinity. We must, therefore,
acquiesce in the Course of human events it becomes necessary for the
sole purpose of fatiguing them into compliance with his measures. He
has combined with others to be totally dissolved; and that as Free and
Independent States, that they are endowed by their legislature to
extend an unwarrantable jurisdiction over us. We have Petitioned for
Redress in the Name, and by Authority of the United States of America,
in General Congress, Assembled, appealing to the opinions of mankind
requires that they are accustomed..." - Thomas Jefferson (before he
honed his message)
Background
Human language is highly structured and nonrandom:
not every sequence of letters forms
a word, and not every sequence of words forms an intelligible sentence.
But some aspects of random language begin to become familiar if
sufficient structure is captured. In addition,
the patterns of word choice vary between different authors. You can generate
surprising messages, with much of the style maintained (but none of the
meaning) by duplicating the nearest and next-nearest neighbor correlations
between words in an otherwise random sequence.
Learning Goals
Science:
This is mostly for fun, and to introduce the Python programming language.
Similar concepts are at work, however, in a number of important scientific
problems that use Markov chains to model probabilistic processes. In genomics,
for example, there is much evidence of higher-order structure in DNA
sequences; that is, it is not sufficient to describe only the frequency
of nucleotide bases A,C,G,T in a sequence, but also higher-order correlations.
The use of base triplet codons to encode amino acids in protein sequences,
for example, introduces important three-letter correlations.
In terms of human text, you will gain a qualitative
feel for how the three-word correlations help provide a characteristic
feel to the English language, and how they appear
intrinsic to the voice of different authors.
Computation:
This is a simple exercise to familiarize you with some
of the syntax of Python, how to use the editor, etc.
You will learn some of how to work with file input/output, lists, and
dictionaries. For those new to programming this will also introduce
the concept of a "function".
Write a program which will takes as input an arbitrary text file, and
returns as output a string which is a random text with the same statistical
structure (up to three-word correlations) as the original text.
Please verify that you gain an
understanding of the programming
concepts described in the notes.
Procedure
- Create a new directory
- Download the following files:
- Rename RandTextHints.py
as RandText.py, and open it in a text editor (emacs).
- Open a terminal window, move to the correct directory and start ipython
- In the file "RandText.py", write a function (read_file_into_word_list) which will take the name of a file
(say "shelovesyou.txt") and returns a list of words.
- Test your function by typing in your python session
%run RandText.py,
then read_file_into_word_list("shelovesyou.txt")
- If that works try asigning this list of words to a variable:
words1=read_file_into_word_list("shelovesyou.txt")
- You can manipulate this object with commands like
print len(words1)
- Note: this is a general purpose function that you could use in other programs.
- In the file "RandText.py", write a function
(make_prefix_dictionary) which will take the name of a file
and return a dictionary
where every two consecutive words are the "key" and the
next word is the "value".
The file "RandTextHints.py" gives a detailed sample algorithm.
- You should call your previous function from this function.
- Again test it using "shelovesyou.txt"
- In the file "RandText.py", write a function (make_random_text)
which will take the name of a file and
the number of words desired. This function will return the desired string.
The file "RandTextHints.py" gives a detailed sample algorithm.
- You should call your previous function from this function.
- Again test it using "shelovesyou.txt"
- Generate some random text from all of the different text files.
- Do you see any differences in the output?
- Do any of the random strings make sense?
- Could you do the same thing but with 3 word correlations?
What do you expect would happen?
- Once you've generated some random text,
head on over to Wordle to make some
"word clouds" from your text. While visually interesting,
word clouds seem to deal only with word frequencies, rather than
higher-order correlations, so it doesn't matter much that you've
scrambled the original text. The FAQ
indicates that words can be tied together in the clouds, but that's
different than grouping together words probabilistically
based on correlations, which seems similar in spirit to questions
regarding optimal graph layout algorithms.
If you're interested in generating higher-order word
clouds, that could be an interesting project for the course. The
Python Imaging Library could be a useful tool for generating those
sorts of images.
(Unfortunately,
we can't peek at the Wordle source code to get some insight into
their algorithm.)
Background and Notes
Functions and procedures
For us, a function is a small
computer program that does a well defined task. Typically a function
takes some sort of input, and produces output. One strings together functions
to build up a program.
Some programming languages distinguish between functions (which return
a value and typically have no side effects) and procedures (which do
not return a value and therefore only operate through side effects).
Python does not make this distinction: all functions return a value
(even if that value is the object None, which is the default
return value if no return statement is included), and functions
can have side effects (e.g., to modify objects that are passed to
the function).
Functions are used for several reasons:
- They allow you to reuse codes
- Less programming
- Less debugging
- They lead to more easily readable code
- They separate the "what" from the "how"
- In an interactive environment they let you play:
you can operate on objects just as you would with paper and pencil
More info: Wikipedia Subroutine
[Syntax:
Python documentation
or Python tutorial]
Dictionaries (Hash lists)
A python dictionary is an object that associates a
value with a key in such a way that
looking up the value is quick. This is a very useful construct,
with many nice examples
in the textbook. A good introduction
into python dictionaries can be found in
.
[Syntax: Python Manual
and Mapping Types ]
Useful Constructions/Syntax
Files
- Full list of useful files:
Links
James P. Sethna,
Christopher R. Myers,
Erich J. Mueller.
Last modified: August 14, 2008