Essays
Добавить в закладки К обложке
- Programming Bottom-Up - Страница 1
- Lisp for Web-Based Applications - Страница 3
- Beating the Averages - Страница 6
- Java's Cover - Страница 12
- Being Popular - Страница 14
- Five Questions about Language Design - Страница 24
- The Roots of Lisp - Страница 28
- The Other Road Ahead - Страница 29
- What Made Lisp Different - Страница 44
- Why Arc Isn't Especially Object-Oriented - Страница 45
- Taste for Makers - Страница 46
- What Languages Fix - Страница 52
- Succinctness is Power - Страница 53
- Revenge of the Nerds - Страница 57
- A Plan for Spam - Страница 65
- Design and Research - Страница 72
- Better Bayesian Filtering - Страница 76
- Why Nerds are Unpopular - Страница 82
- The Hundred-Year Language - Страница 90
- If Lisp is So Great - Страница 97
- Hackers and Painters - Страница 98
- Filters that Fight Back - Страница 105
- What You Can't Say - Страница 107
- The Word "Hacker" - Страница 114
- The Python Paradox - Страница 117
- Great Hackers - Страница 118
- The Age of the Essay - Страница 125
- What the Bubble Got Right - Страница 131
- Bradley's Ghost - Страница 136
- Made in USA - Страница 137
- What You'll Wish You'd Known - Страница 140
- How to Start a Startup - Страница 147
- A Unified Theory of VC Suckagepad - Страница 159
- Undergraduation - Страница 161
- Writing, Briefly - Страница 166
- Return of the Mac - Страница 167
- Why Smart People Have Bad Ideas - Страница 169
- The Submarine - Страница 173
- Hiring is Obsolete - Страница 177
- What Business Can Learn from Open Source - Страница 183
- After the Ladder - Страница 189
- Inequality and Risk - Страница 190
- What I Did this Summer - Страница 194
- Ideas for Startups - Страница 198
- The Venture Capital Squeeze - Страница 203
- How to Fund a Startup - Страница 205
- Web 2.0 - Страница 217
- How to Make Wealth - Страница 222
- Good and Bad Procrastination - Страница 233
- How to Do What You Love - Страница 236
- Are Software Patents Evil? - Страница 242
- The Hardest Lessons for Startups to Learn - Страница 248
- How to Be Silicon Valley - Страница 255
- Why Startups Condense in America - Страница 260
- The Power of the Marginal - Страница 267
- The Island Test - Страница 275
- Copy What You Like - Страница 276
- How to Present to Investors - Страница 278
- A Student's Guide to Startups - Страница 282
- The 18 Mistakes That Kill Startups - Страница 290
- Mind the Gap - Страница 297
- How Art Can Be Good - Страница 305
- Learning from Founders - Страница 310
- Is It Worth Being Wise? - Страница 311
- Why to Not Not Start a Startup - Страница 316
- Microsoft is Dead - Страница 324
- Two Kinds of Judgement - Страница 326
- The Hacker's Guide to Investors - Страница 327
- An Alternative Theory of Unions - Страница 336
- The Equity Equation - Страница 337
- Stuff - Страница 339
- Holding a Program in One's Head - Страница 341
- How Not to Die - Страница 344
- News from the Front - Страница 347
- How to Do Philosophy - Страница 350
- The Future of Web Startups - Страница 357
- Why to Move to a Startup Hub - Страница 362
- Six Principles for Making New Things - Страница 364
- Trolls - Страница 366
- A New Venture Animal - Страница 368
- You Weren't Meant to Have a Boss - Страница 371
Here is an example of a spam that arrived while I was writing this article. The fifteen most interesting words in this spam are:
qvp0045 indira mx-05 intimail $7500 freeyankeedom cdo bluefoxmedia jpg unsecured platinum 3d0 qves 7c5 7c266675
The words are a mix of stuff from the headers and from the message body, which is typical of spam. Also typical of spam is that every one of these words has a spam probability, in my database, of .99. In fact there are more than fifteen words with probabilities of .99, and these are just the first fifteen seen.
Unfortunately that makes this email a boring example of the use of Bayes' Rule. To see an interesting variety of probabilities we have to look at this actually quite atypical spam.
The fifteen most interesting words in this spam, with their probabilities, are:
madam 0.99 promotion 0.99 republic 0.99 shortest 0.047225013 mandatory 0.047225013 standardization 0.07347802 sorry 0.08221981 supported 0.09019077 people's 0.09019077 enter 0.9075001 quality 0.8921298 organization 0.12454646 investment 0.8568143 very 0.14758544 valuable 0.82347786
This time the evidence is a mix of good and bad. A word like "shortest" is almost as much evidence for innocence as a word like "madam" or "promotion" is for guilt. But still the case for guilt is stronger. If you combine these numbers according to Bayes' Rule, the resulting probability is .9027.
"Madam" is obviously from spams beginning "Dear Sir or Madam." They're not very common, but the word "madam" never occurs in my legitimate email, and it's all about the ratio.
"Republic" scores high because it often shows up in Nigerian scam emails, and also occurs once or twice in spams referring to Korea and South Africa. You might say that it's an accident that it thus helps identify this spam. But I've found when examining spam probabilities that there are a lot of these accidents, and they have an uncanny tendency to push things in the right direction rather than the wrong one. In this case, it is not entirely a coincidence that the word "Republic" occurs in Nigerian scam emails and this spam. There is a whole class of dubious business propositions involving less developed countries, and these in turn are more likely to have names that specify explicitly (because they aren't) that they are republics.[3]
On the other hand, "enter" is a genuine miss. It occurs mostly in unsubscribe instructions, but here is used in a completely innocent way. Fortunately the statistical approach is fairly robust, and can tolerate quite a lot of misses before the results start to be thrown off.
For comparison, here is an example of that rare bird, a spam that gets through the filters. Why? Because by sheer chance it happens to be loaded with words that occur in my actual email:
perl 0.01 python 0.01 tcl 0.01 scripting 0.01 morris 0.01 graham 0.01491078 guarantee 0.9762507 cgi 0.9734398 paul 0.027040077 quite 0.030676773 pop3 0.042199217 various 0.06080265 prices 0.9359873 managed 0.06451222 difficult 0.071706355
There are a couple pieces of good news here. First, this mail probably wouldn't get through the filters of someone who didn't happen to specialize in programming languages and have a good friend called Morris. For the average user, all the top five words here would be neutral and would not contribute to the spam probability.
Second, I think filtering based on word pairs (see below) might well catch this one: "cost effective", "setup fee", "money back" -- pretty incriminating stuff. And of course if they continued to spam me (or a network I was part of), "Hostex" itself would be recognized as a spam term.
Finally, here is an innocent email. Its fifteen most interesting words are as follows:
continuation 0.01 describe 0.01 continuations 0.01 example 0.033600237 programming 0.05214485 i'm 0.055427782 examples 0.07972858 color 0.9189189 localhost 0.09883721 hi 0.116539136 california 0.84421706 same 0.15981844 spot 0.1654587 us-ascii 0.16804294 what 0.19212411
Most of the words here indicate the mail is an innocent one. There are two bad smelling words, "color" (spammers love colored fonts) and "California" (which occurs in testimonials and also in menus in forms), but they are not enough to outweigh obviously innocent words like "continuation" and "example".
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318
- 319
- 320
- 321
- 322
- 323
- 324
- 325
- 326
- 327
- 328
- 329
- 330
- 331
- 332
- 333
- 334
- 335
- 336
- 337
- 338
- 339
- 340
- 341
- 342
- 343
- 344
- 345
- 346
- 347
- 348
- 349
- 350
- 351
- 352
- 353
- 354
- 355
- 356
- 357
- 358
- 359
- 360
- 361
- 362
- 363
- 364
- 365
- 366
- 367
- 368
- 369
- 370
- 371
- 372
- 373
- 374