Taricorp.net Hurtling toward an uncertain future

18Jul/100

markov.py

This was a little for-fun project that I built: a Python module/script that can be used to semi-randomly generate words, based on Markov chains.

Background, implementation

I was inspired by recalling the story of the Automated Curse Generator, which seemed like something that would be interesting to implement for fun in my own time, as it did indeed turn out to be.  In short, the module examines input text and generates a graph with edges weighted based on character frequency, then traverses the graph to generate a word.

To generate the chains, the module builds a directed graph based on the seed text, where characters are linked to all the characters which are known to follow them, with edges weighted according to the percentage of all following characters any particular character consists of.  For example, the string "zezifadi r00lz dr" would generate the following graph, where the value of each edge is the probability of choosing that edge to leave the associated vertex:

Graphviz

Click for graphviz source code.

To generate a word, then, it can be as simple as starting at ' ' (the red node) and continuing to traverse the graph until another ' ' is encountered.  In reality, while that worked, it was awfully boring.  When seeded with some text in English, there was a disappointing number of short, boring (not to mention unpronounceable) words and far too few amusing longer ones.  Think 'ad' and 's' rather than 'throm'.

It was rather easy to generate more interesting words, however, by simply adding some word-length limits, defaulting to a minimum of 4 character and a maximum of 12, tunable via arguments to the word generation method of the map.  Rather than blindly following edges, as long as the word generated is shorter than the minimum, any chaining result of ' ' will be ignored.  When maximum length is reached, the word will be immediately terminated provided the current character has any connection to blank space.  If not, generation continues until such a connection is found.

What makes this so entertaining, I think, is its versatility.  Since word generation is based entirely on the character frequency statistics of the input text, it works for any language.  By extension, that means it could be easily be made to generate whole phrases in $(East-Asian language of your choice) by feeding it ideographs rather than Latin characters (ばかです (yes, I'm aware this is actually Kana)), or just nonsense that pronounces a lot like Simlish by putting in some other Simlish nonsense.

The script

Having implemented word generation in the module, it was reasonably short work to wrap the whole thing in a script so it could be invoked from the command line for great lulz.  Something like the following does a decent job of providing amusement by generating a word every 15 seconds.  For more fun, pipe the output into a speech synthesizer.

Tari@Kerwin ~ $ while markov.py; do sleep 15; done

Of course, before anything can be generated, a graph must be generated, which can be done via the -s option on the script or by invoking the addString method of MarkovMap.  Quick example:

Tari@Kerwin ~ $ # Add the given string to the current graph, or to a new one.
Tari@Kerwin ~ $ markov.py -s"String to seed with" -ffoo.pkl
IO error on foo.pkl, creating new map
seeeeed
Tari@Kerwin ~ $ # Add some Delmore Schwartz to the map via stdin
Tari@Kerwin ~ $ markov.py -ffoo.pkl -s- << EOF
> (This is the school in which we learn...)
>What is the self amid this  blaze?
>What am I now that I was then
>Which I shall suffer and act  again,
>The theodicy I wrote in my high school days
>Restored all  life from infancy,
>The children shouting are bright as they run
>(This  is the school in which they learn...)
>Ravished entirely in their  passing play!
>(...that time is the fire in which they burn.)
>EOF
idagheam
Tari@Kerwin ~ $ # Generate a word from the default graph in file markov.pkl
Tari@Kerwin ~ $ markov.py
awaike
Tari@Kerwin ~ $

Easy enough.  I've found that a Maori seed (via Project Gutenburg) makes for some of the more easily pronounced words, but any language will (mostly) generate words that are pronounceable via that language's pronunciation rules.

For seeding with non-Latin character sets, the script can take the -l or --lax option ('strict' keyword parameter to MarkovMap.addString()), which removes the restriction keeping graphed characters as only alphabetic.  The downside, then, is that everything in the input is mapped out, so you're much more likely to get garbage out unless the input is carefully sanitized of punctuation and such (GIGO, after all).

Code

Enough talk, I'm sure you just want to pick apart my code and play with nonsense words at this point.  Download link is below.  I'm providing the code under the Simplified BSD License so you're allowed to do nearly anything with it, I just ask that you credit me for it in some way if you reuse or redistribute it.

Download markov.py

30May/100

PuTTYJL

After putting up with the lack of support for Windows 7's jump lists in PuTTY for a while, I finally got tired enough of it to do something.  Nothing as cool as patching PuTTY to do them itself, but I wrote a wrapper which indexes the saved sessions, allowing the user to select which ones should be included in the list.

From the project page:

PuTTYJL is a wrapper and patch for PuTTY written in C# for .NET 3.5 and Windows 7, adding support for the new Jump Lists, allowing you to create jump list entries for saved sessions in the registry and optionally just launch the wrapper to start a default session in PuTTY.

Get it here.

3May/100

CPU Comparison Shopping

I've been slowly working towards putting together a new PC build to replace my current one, a Core 2 Duo- based system I built about three years ago, which is starting to show its age.  In the interest of comparison shopping, I put together a spreadsheet and some charts looking at the newer Intel (i5/i7) and AMD (Phenom X4/X6) processors.  Turns out that Intel's Core i5-750 seems to be the best deal in processors for what I'm looking for in a system at the moment.

Raw Data

Clock speeds are in MHz, TDP in Watts, and cost is price in USD at newegg as of 5/3/2010.  Processors with SMT (hyperthreading) are noted in the Cores column.

Manufacturer Model Cores Clock TDP Cost
AMD Phenom II X4 955 BE 4 3200 125 159.99
AMD Phenom II X4 940 BE 4 3000 125 161.99
AMD Phenom II X4 965 BE 4 3400 125 180.99
AMD Phenom II X6 1090T 6 3200 125 309.99
Intel Core i5-650 2 3200 73 184.99
Intel Core i5-661 2 3330 87 199.99
Intel Core i7-920 4 (SMT) 2660 130 279.99
Intel Core i7-930 4 (SMT) 2800 130 294.99
Intel Core i5-750 4 2660 95 199.99
Intel Core i7-860 4 (SMT) 2800 95 279.99