Essays
Добавить в закладки К обложке
- Programming Bottom-Up - Страница 1
- Lisp for Web-Based Applications - Страница 3
- Beating the Averages - Страница 6
- Java's Cover - Страница 12
- Being Popular - Страница 14
- Five Questions about Language Design - Страница 24
- The Roots of Lisp - Страница 28
- The Other Road Ahead - Страница 29
- What Made Lisp Different - Страница 44
- Why Arc Isn't Especially Object-Oriented - Страница 45
- Taste for Makers - Страница 46
- What Languages Fix - Страница 52
- Succinctness is Power - Страница 53
- Revenge of the Nerds - Страница 57
- A Plan for Spam - Страница 65
- Design and Research - Страница 72
- Better Bayesian Filtering - Страница 76
- Why Nerds are Unpopular - Страница 82
- The Hundred-Year Language - Страница 90
- If Lisp is So Great - Страница 97
- Hackers and Painters - Страница 98
- Filters that Fight Back - Страница 105
- What You Can't Say - Страница 107
- The Word "Hacker" - Страница 114
- The Python Paradox - Страница 117
- Great Hackers - Страница 118
- The Age of the Essay - Страница 125
- What the Bubble Got Right - Страница 131
- Bradley's Ghost - Страница 136
- Made in USA - Страница 137
- What You'll Wish You'd Known - Страница 140
- How to Start a Startup - Страница 147
- A Unified Theory of VC Suckagepad - Страница 159
- Undergraduation - Страница 161
- Writing, Briefly - Страница 166
- Return of the Mac - Страница 167
- Why Smart People Have Bad Ideas - Страница 169
- The Submarine - Страница 173
- Hiring is Obsolete - Страница 177
- What Business Can Learn from Open Source - Страница 183
- After the Ladder - Страница 189
- Inequality and Risk - Страница 190
- What I Did this Summer - Страница 194
- Ideas for Startups - Страница 198
- The Venture Capital Squeeze - Страница 203
- How to Fund a Startup - Страница 205
- Web 2.0 - Страница 217
- How to Make Wealth - Страница 222
- Good and Bad Procrastination - Страница 233
- How to Do What You Love - Страница 236
- Are Software Patents Evil? - Страница 242
- The Hardest Lessons for Startups to Learn - Страница 248
- How to Be Silicon Valley - Страница 255
- Why Startups Condense in America - Страница 260
- The Power of the Marginal - Страница 267
- The Island Test - Страница 275
- Copy What You Like - Страница 276
- How to Present to Investors - Страница 278
- A Student's Guide to Startups - Страница 282
- The 18 Mistakes That Kill Startups - Страница 290
- Mind the Gap - Страница 297
- How Art Can Be Good - Страница 305
- Learning from Founders - Страница 310
- Is It Worth Being Wise? - Страница 311
- Why to Not Not Start a Startup - Страница 316
- Microsoft is Dead - Страница 324
- Two Kinds of Judgement - Страница 326
- The Hacker's Guide to Investors - Страница 327
- An Alternative Theory of Unions - Страница 336
- The Equity Equation - Страница 337
- Stuff - Страница 339
- Holding a Program in One's Head - Страница 341
- How Not to Die - Страница 344
- News from the Front - Страница 347
- How to Do Philosophy - Страница 350
- The Future of Web Startups - Страница 357
- Why to Move to a Startup Hub - Страница 362
- Six Principles for Making New Things - Страница 364
- Trolls - Страница 366
- A New Venture Animal - Страница 368
- You Weren't Meant to Have a Boss - Страница 371
It's interesting that "describe" rates as so thoroughly innocent. It hasn't occurred in a single one of my 4000 spams. The data turns out to be full of such surprises. One of the things you learn when you analyze spam texts is how narrow a subset of the language spammers operate in. It's that fact, together with the equally characteristic vocabulary of any individual user's mail, that makes Bayesian filtering a good bet.
Appendix: More IdeasOne idea that I haven't tried yet is to filter based on word pairs, or even triples, rather than individual words. This should yield a much sharper estimate of the probability. For example, in my current database, the word "offers" has a probability of .96. If you based the probabilities on word pairs, you'd end up with "special offers" and "valuable offers" having probabilities of .99 and, say, "approach offers" (as in "this approach offers") having a probability of .1 or less.
The reason I haven't done this is that filtering based on individual words already works so well. But it does mean that there is room to tighten the filters if spam gets harder to detect. (Curiously, a filter based on word pairs would be in effect a Markov-chaining text generator running in reverse.)
Specific spam features (e.g. not seeing the recipient's address in the to: field) do of course have value in recognizing spam. They can be considered in this algorithm by treating them as virtual words. I'll probably do this in future versions, at least for a handful of the most egregious spam indicators. Feature-recognizing spam filters are right in many details; what they lack is an overall discipline for combining evidence.
Recognizing nonspam features may be more important than recognizing spam features. False positives are such a worry that they demand extraordinary measures. I will probably in future versions add a second level of testing designed specifically to avoid false positives. If a mail triggers this second level of filters it will be accepted even if its spam probability is above the threshold.
I don't expect this second level of filtering to be Bayesian. It will inevitably be not only ad hoc, but based on guesses, because the number of false positives will not tend to be large enough to notice patterns. (It is just as well, anyway, if a backup system doesn't rely on the same technology as the primary system.)
Another thing I may try in the future is to focus extra attention on specific parts of the email. For example, about 95% of current spam includes the url of a site they want you to visit. (The remaining 5% want you to call a phone number, reply by email or to a US mail address, or in a few cases to buy a certain stock.) The url is in such cases practically enough by itself to determine whether the email is spam.
Domain names differ from the rest of the text in a (non-German) email in that they often consist of several words stuck together. Though computationally expensive in the general case, it might be worth trying to decompose them. If a filter has never seen the token "xxxporn" before it will have an individual spam probability of .4, whereas "xxx" and "porn" individually have probabilities (in my corpus) of .9889 and .99 respectively, and a combined probability of .9998.
I expect decomposing domain names to become more important as spammers are gradually forced to stop using incriminating words in the text of their messages. (A url with an ip address is of course an extremely incriminating sign, except in the mail of a few sysadmins.)
It might be a good idea to have a cooperatively maintained list of urls promoted by spammers. We'd need a trust metric of the type studied by Raph Levien to prevent malicious or incompetent submissions, but if we had such a thing it would provide a boost to any filtering software. It would also be a convenient basis for boycotts.
Another way to test dubious urls would be to send out a crawler to look at the site before the user looked at the email mentioning it. You could use a Bayesian filter to rate the site just as you would an email, and whatever was found on the site could be included in calculating the probability of the email being a spam. A url that led to a redirect would of course be especially suspicious.
One cooperative project that I think really would be a good idea would be to accumulate a giant corpus of spam. A large, clean corpus is the key to making Bayesian filtering work well. Bayesian filters could actually use the corpus as input. But such a corpus would be useful for other kinds of filters too, because it could be used to test them.
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318
- 319
- 320
- 321
- 322
- 323
- 324
- 325
- 326
- 327
- 328
- 329
- 330
- 331
- 332
- 333
- 334
- 335
- 336
- 337
- 338
- 339
- 340
- 341
- 342
- 343
- 344
- 345
- 346
- 347
- 348
- 349
- 350
- 351
- 352
- 353
- 354
- 355
- 356
- 357
- 358
- 359
- 360
- 361
- 362
- 363
- 364
- 365
- 366
- 367
- 368
- 369
- 370
- 371
- 372
- 373
- 374