Essays
Добавить в закладки К обложке
- Programming Bottom-Up - Страница 1
- Lisp for Web-Based Applications - Страница 3
- Beating the Averages - Страница 6
- Java's Cover - Страница 12
- Being Popular - Страница 14
- Five Questions about Language Design - Страница 24
- The Roots of Lisp - Страница 28
- The Other Road Ahead - Страница 29
- What Made Lisp Different - Страница 44
- Why Arc Isn't Especially Object-Oriented - Страница 45
- Taste for Makers - Страница 46
- What Languages Fix - Страница 52
- Succinctness is Power - Страница 53
- Revenge of the Nerds - Страница 57
- A Plan for Spam - Страница 65
- Design and Research - Страница 72
- Better Bayesian Filtering - Страница 76
- Why Nerds are Unpopular - Страница 82
- The Hundred-Year Language - Страница 90
- If Lisp is So Great - Страница 97
- Hackers and Painters - Страница 98
- Filters that Fight Back - Страница 105
- What You Can't Say - Страница 107
- The Word "Hacker" - Страница 114
- The Python Paradox - Страница 117
- Great Hackers - Страница 118
- The Age of the Essay - Страница 125
- What the Bubble Got Right - Страница 131
- Bradley's Ghost - Страница 136
- Made in USA - Страница 137
- What You'll Wish You'd Known - Страница 140
- How to Start a Startup - Страница 147
- A Unified Theory of VC Suckagepad - Страница 159
- Undergraduation - Страница 161
- Writing, Briefly - Страница 166
- Return of the Mac - Страница 167
- Why Smart People Have Bad Ideas - Страница 169
- The Submarine - Страница 173
- Hiring is Obsolete - Страница 177
- What Business Can Learn from Open Source - Страница 183
- After the Ladder - Страница 189
- Inequality and Risk - Страница 190
- What I Did this Summer - Страница 194
- Ideas for Startups - Страница 198
- The Venture Capital Squeeze - Страница 203
- How to Fund a Startup - Страница 205
- Web 2.0 - Страница 217
- How to Make Wealth - Страница 222
- Good and Bad Procrastination - Страница 233
- How to Do What You Love - Страница 236
- Are Software Patents Evil? - Страница 242
- The Hardest Lessons for Startups to Learn - Страница 248
- How to Be Silicon Valley - Страница 255
- Why Startups Condense in America - Страница 260
- The Power of the Marginal - Страница 267
- The Island Test - Страница 275
- Copy What You Like - Страница 276
- How to Present to Investors - Страница 278
- A Student's Guide to Startups - Страница 282
- The 18 Mistakes That Kill Startups - Страница 290
- Mind the Gap - Страница 297
- How Art Can Be Good - Страница 305
- Learning from Founders - Страница 310
- Is It Worth Being Wise? - Страница 311
- Why to Not Not Start a Startup - Страница 316
- Microsoft is Dead - Страница 324
- Two Kinds of Judgement - Страница 326
- The Hacker's Guide to Investors - Страница 327
- An Alternative Theory of Unions - Страница 336
- The Equity Equation - Страница 337
- Stuff - Страница 339
- Holding a Program in One's Head - Страница 341
- How Not to Die - Страница 344
- News from the Front - Страница 347
- How to Do Philosophy - Страница 350
- The Future of Web Startups - Страница 357
- Why to Move to a Startup Hub - Страница 362
- Six Principles for Making New Things - Страница 364
- Trolls - Страница 366
- A New Venture Animal - Страница 368
- You Weren't Meant to Have a Boss - Страница 371
Better Bayesian Filtering
(This article was given as a talk at the 2003 Spam Conference. It describes the work I've done to improve the performance of the algorithm described in A Plan for Spam, and what I plan to do in the future.)
The first discovery I'd like to present here is an algorithm for lazy evaluation of research papers. Just write whatever you want and don't cite any previous work, and indignant readers will send you references to all the papers you should have cited. I discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot.
Spam filtering is a subset of text classification, which is a well established field, but the first papers about Bayesian spam filtering per se seem to have been two given at the same conference in 1998, one by Pantel and Lin [2], and another by a group from Microsoft Research [3].
When I heard about this work I was a bit surprised. If people had been onto Bayesian filtering four years ago, why wasn't everyone using it? When I read the papers I found out why. Pantel and Lin's filter was the more effective of the two, but it only caught 92% of spam, with 1.16% false positives.
When I tried writing a Bayesian spam filter, it caught 99.5% of spam with less than .03% false positives [4]. It's always alarming when two people trying the same experiment get widely divergent results. It's especially alarming here because those two sets of numbers might yield opposite conclusions. Different users have different requirements, but I think for many people a filtering rate of 92% with 1.16% false positives means that filtering is not an acceptable solution, whereas 99.5% with less than .03% false positives means that it is.
So why did we get such different numbers? I haven't tried to reproduce Pantel and Lin's results, but from reading the paper I see five things that probably account for the difference.
One is simply that they trained their filter on very little data: 160 spam and 466 nonspam mails. Filter performance should still be climbing with data sets that small. So their numbers may not even be an accurate measure of the performance of their algorithm, let alone of Bayesian spam filtering in general.
But I think the most important difference is probably that they ignored message headers. To anyone who has worked on spam filters, this will seem a perverse decision. And yet in the very first filters I tried writing, I ignored the headers too. Why? Because I wanted to keep the problem neat. I didn't know much about mail headers then, and they seemed to me full of random stuff. There is a lesson here for filter writers: don't ignore data. You'd think this lesson would be too obvious to mention, but I've had to learn it several times.
Third, Pantel and Lin stemmed the tokens, meaning they reduced e.g. both ``mailing'' and ``mailed'' to the root ``mail''. They may have felt they were forced to do this by the small size of their corpus, but if so this is a kind of premature optimization.
Fourth, they calculated probabilities differently. They used all the tokens, whereas I only use the 15 most significant. If you use all the tokens you'll tend to miss longer spams, the type where someone tells you their life story up to the point where they got rich from some multilevel marketing scheme. And such an algorithm would be easy for spammers to spoof: just add a big chunk of random text to counterbalance the spam terms.
Finally, they didn't bias against false positives. I think any spam filtering algorithm ought to have a convenient knob you can twist to decrease the false positive rate at the expense of the filtering rate. I do this by counting the occurrences of tokens in the nonspam corpus double.
I don't think it's a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure. Spam filtering is not just classification, because false positives are so much worse than false negatives that you should treat them as a different kind of error. And the source of error is not just random variation, but a live human spammer working actively to defeat your filter.
TokensAnother project I heard about after the Slashdot article was Bill Yerazunis' CRM114 [5]. This is the counterexample to the design principle I just mentioned. It's a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that's what it's doing.
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318
- 319
- 320
- 321
- 322
- 323
- 324
- 325
- 326
- 327
- 328
- 329
- 330
- 331
- 332
- 333
- 334
- 335
- 336
- 337
- 338
- 339
- 340
- 341
- 342
- 343
- 344
- 345
- 346
- 347
- 348
- 349
- 350
- 351
- 352
- 353
- 354
- 355
- 356
- 357
- 358
- 359
- 360
- 361
- 362
- 363
- 364
- 365
- 366
- 367
- 368
- 369
- 370
- 371
- 372
- 373
- 374