Words
I wrote an Objective-C program to parse all of my iChat log files (necessary because the format is a binary property list that I really didn't want to have to parse by hand) which go back to early 2005 (the pre-2005 logs are backed up elsewhere and in another format, so I'm ignoring them for now) - and produce some SQL insert statements; one per chat message. After much anguish regarding how to properly escape string literals (specifically backslashes; hint: double-escape the backslash and use the octal representation, ie in a C-like language, replace "\\" with "\\\\134", which is really "\\134" in the SQL statement, which the Postgres parser then turns into "\134" for bytea insertion - thank you Steven for typing a :-\ smiley face and costing me an hour of my life), the doubly-escaped SQL output was then dumped into Postgres and we finally have something to work with.
I had 291,640 total chat messages (150,091 received, 141,549 sent) comprised of 2,503,521 words (41,415 distinct words), including typos and garbage contributed by 170 "people" (ie, screen names) over 911 days between March 9, 2005 and March 31, 2009. If the set of distinct words is reduced to just those words used at least five times, it works out to 13,342 words (8,640 received, 9,374 sent). Not exactly a testament to my vocabulary. Anyway....
Then I wrote a Perlprogram to pull down each chat message, split it into words, lowercase it, create a unique word entry if the word hasn't been encountered before, and associate that word id with that chat id. An apostrophe that does not occur at the beginning or ending of a word was the only non-A-Z character I allowed in a word; everything else defined the boundary of a word. After watching four Futurama episodes it still wasn't done so I went to bed. As if by magic, it was done in the morning. (This process would go much, much faster if I put the directory recursing code in the Objective-C program instead of doing the recursing in the shell. But I really don't care.)
Then I wrote some SQL to extract the top 100 most used words by people I chat with, and the top 100 most used words by me. The results can be seen in the table that follows.
The people I talk with apparently tend to talk about themselves more than I talk about myself (I save that for my blog, I guess):
"I" is #1 vs #2
"my" is #15 vs #18
"me" is #22 vs #26
and about me more than I talk about them (with the possible exception that it could just be lots of instances of speaking of oneself in third person):
"you" is #5 vs #9
"your" is #40 for both
and apparently a lot of people send me links because "http" made it in as #90.
Update: I found this site, wordcount.org (please try to ignore its interface which tries to make the information as useless as possible), which ranks English words by popularity (of use). What's interesting is that its top-ten is very different. "I" doesn't even make it into the top-ten. I guess instant messaging isn't entirely similar to their British National Corpus data source.
| # | word |
| 1 | the |
| 2 | of |
| 3 | and |
| 4 | to |
| 5 | a |
| 6 | in |
| 7 | that |
| 8 | it |
| 9 | is |
| 10 | was |
Okay, here's the table. Top-100 words by use, separated into incoming vs outgoing.
| # | send | count (%) | recv | count (%) | ||
| 1 | the | 54961 (3.94%) | i | 38623 (3.85%) | ||
| 2 | i | 49867 (3.58%) | the | 35158 (3.50%) | ||
| 3 | to | 42962 (3.08%) | to | 28671 (2.85%) | ||
| 4 | a | 33516 (2.41%) | a | 22164 (2.21%) | ||
| 5 | that | 29770 (2.14%) | you | 18797 (1.87%) | ||
| 6 | and | 27304 (1.96%) | it | 18576 (1.85%) | ||
| 7 | it | 26669 (1.91%) | and | 16629 (1.66%) | ||
| 8 | of | 24041 (1.73%) | that | 14723 (1.47%) | ||
| 9 | you | 22118 (1.59%) | is | 14295 (1.42%) | ||
| 10 | is | 16816 (1.21%) | of | 11547 (1.15%) | ||
| 11 | in | 15278 (1.10%) | in | 10946 (1.09%) | ||
| 12 | for | 13483 (0.97%) | for | 9729 (0.97%) | ||
| 13 | have | 11649 (0.84%) | have | 8609 (0.86%) | ||
| 14 | on | 11580 (0.83%) | on | 7996 (0.80%) | ||
| 15 | so | 11487 (0.82%) | my | 7858 (0.78%) | ||
| 16 | be | 11330 (0.81%) | so | 7332 (0.73%) | ||
| 17 | was | 10089 (0.72%) | but | 6901 (0.69%) | ||
| 18 | my | 9809 (0.70%) | be | 6805 (0.68%) | ||
| 19 | not | 9561 (0.69%) | was | 6706 (0.67%) | ||
| 20 | but | 9344 (0.67%) | not | 6606 (0.66%) | ||
| 21 | with | 8650 (0.62%) | with | 6000 (0.60%) | ||
| 22 | just | 7988 (0.57%) | me | 5837 (0.58%) | ||
| 23 | i'm | 7756 (0.56%) | yeah | 5811 (0.58%) | ||
| 24 | if | 7214 (0.52%) | just | 5728 (0.57%) | ||
| 25 | at | 6934 (0.50%) | do | 5589 (0.56%) | ||
| 26 | me | 6790 (0.49%) | this | 5306 (0.53%) | ||
| 27 | that's | 6481 (0.47%) | i'm | 5252 (0.52%) | ||
| 28 | they | 6394 (0.46%) | like | 5103 (0.51%) | ||
| 29 | this | 6211 (0.45%) | are | 4998 (0.50%) | ||
| 30 | do | 6139 (0.44%) | they | 4993 (0.50%) | ||
| 31 | one | 6118 (0.44%) | what | 4921 (0.49%) | ||
| 32 | don't | 5984 (0.43%) | it's | 4743 (0.47%) | ||
| 33 | as | 5757 (0.41%) | if | 4662 (0.46%) | ||
| 34 | or | 5750 (0.41%) | at | 4582 (0.46%) | ||
| 35 | up | 5745 (0.41%) | get | 4377 (0.44%) | ||
| 36 | can | 5727 (0.41%) | has | 4271 (0.43%) | ||
| 37 | out | 5338 (0.38%) | no | 4246 (0.42%) | ||
| 38 | what | 5274 (0.38%) | one | 4165 (0.41%) | ||
| 39 | all | 5245 (0.38%) | or | 4146 (0.41%) | ||
| 40 | like | 5177 (0.37%) | now | 4054 (0.40%) | ||
| 41 | about | 4998 (0.36%) | out | 3987 (0.40%) | ||
| 42 | it's | 4992 (0.36%) | can | 3689 (0.37%) | ||
| 43 | your | 4919 (0.35%) | your | 3680 (0.37%) | ||
| 44 | are | 4913 (0.35%) | about | 3654 (0.36%) | ||
| 45 | then | 4892 (0.35%) | don't | 3611 (0.36%) | ||
| 46 | think | 4810 (0.35%) | up | 3608 (0.36%) | ||
| 47 | would | 4765 (0.34%) | all | 3588 (0.36%) | ||
| 48 | get | 4752 (0.34%) | he | 3541 (0.35%) | ||
| 49 | yeah | 4731 (0.34%) | know | 3487 (0.35%) | ||
| 50 | no | 4715 (0.34%) | well | 3377 (0.34%) | ||
| 51 | he | 4714 (0.34%) | would | 3372 (0.34%) | ||
| 52 | there | 4647 (0.33%) | think | 3288 (0.33%) | ||
| 53 | going | 4444 (0.32%) | we | 3131 (0.31%) | ||
| 54 | time | 4396 (0.32%) | there | 3121 (0.31%) | ||
| 55 | well | 4234 (0.30%) | an | 3067 (0.31%) | ||
| 56 | an | 4179 (0.30%) | ok | 3051 (0.30%) | ||
| 57 | some | 4127 (0.30%) | from | 2895 (0.29%) | ||
| 58 | know | 4025 (0.29%) | will | 2868 (0.29%) | ||
| 59 | from | 3880 (0.28%) | that's | 2839 (0.28%) | ||
| 60 | good | 3853 (0.28%) | as | 2821 (0.28%) | ||
| 61 | when | 3800 (0.27%) | good | 2815 (0.28%) | ||
| 62 | very | 3710 (0.27%) | oh | 2802 (0.28%) | ||
| 63 | them | 3671 (0.26%) | how | 2773 (0.28%) | ||
| 64 | work | 3493 (0.25%) | when | 2754 (0.27%) | ||
| 65 | i'll | 3454 (0.25%) | then | 2688 (0.27%) | ||
| 66 | much | 3415 (0.25%) | really | 2680 (0.27%) | ||
| 67 | probably | 3380 (0.24%) | gone | 2583 (0.26%) | ||
| 68 | more | 3349 (0.24%) | going | 2530 (0.25%) | ||
| 69 | had | 3245 (0.23%) | lol | 2513 (0.25%) | ||
| 70 | any | 3238 (0.23%) | work | 2442 (0.24%) | ||
| 71 | which | 3210 (0.23%) | offline | 2419 (0.24%) | ||
| 72 | really | 3144 (0.23%) | go | 2416 (0.24%) | ||
| 73 | something | 3018 (0.22%) | did | 2398 (0.24%) | ||
| 74 | see | 2922 (0.21%) | right | 2376 (0.24%) | ||
| 75 | could | 2892 (0.21%) | some | 2355 (0.23%) | ||
| 76 | right | 2874 (0.21%) | time | 2316 (0.23%) | ||
| 77 | i've | 2860 (0.21%) | more | 2293 (0.23%) | ||
| 78 | did | 2815 (0.20%) | see | 2221 (0.22%) | ||
| 79 | now | 2784 (0.20%) | them | 2216 (0.22%) | ||
| 80 | got | 2761 (0.20%) | need | 2210 (0.22%) | ||
| 81 | should | 2742 (0.20%) | got | 2156 (0.21%) | ||
| 82 | thing | 2626 (0.19%) | had | 2154 (0.21%) | ||
| 83 | want | 2621 (0.19%) | com | 2089 (0.21%) | ||
| 84 | go | 2608 (0.19%) | too | 2065 (0.21%) | ||
| 85 | oh | 2547 (0.18%) | want | 2040 (0.20%) | ||
| 86 | how | 2518 (0.18%) | could | 2010 (0.20%) | ||
| 87 | back | 2515 (0.18%) | she | 1969 (0.20%) | ||
| 88 | we | 2489 (0.18%) | am | 1947 (0.19%) | ||
| 89 | has | 2400 (0.17%) | i'll | 1910 (0.19%) | ||
| 90 | only | 2337 (0.17%) | http | 1909 (0.19%) | ||
| 91 | you're | 2320 (0.17%) | something | 1770 (0.18%) | ||
| 92 | need | 2286 (0.16%) | should | 1765 (0.18%) | ||
| 93 | people | 2285 (0.16%) | yes | 1724 (0.17%) | ||
| 94 | she | 2258 (0.16%) | by | 1723 (0.17%) | ||
| 95 | didn't | 2247 (0.16%) | only | 1715 (0.17%) | ||
| 96 | by | 2165 (0.16%) | people | 1714 (0.17%) | ||
| 97 | sure | 2147 (0.15%) | haha | 1634 (0.16%) | ||
| 98 | other | 2143 (0.15%) | back | 1630 (0.16%) | ||
| 99 | way | 2113 (0.15%) | which | 1615 (0.16%) | ||
| 100 | make | 2110 (0.15%) | make | 1595 (0.16%) |
Shown on a logarithmic scale, it's apparent how quickly word frequency drops off:

I have a ton more email history than chat history - I'll see about parsing through all of that when I get the time. Got to figure out how to avoid the in-line base-64 encoded attachments.