Essays
Добавить в закладки К обложке
- Programming Bottom-Up - Страница 1
- Lisp for Web-Based Applications - Страница 3
- Beating the Averages - Страница 6
- Java's Cover - Страница 12
- Being Popular - Страница 14
- Five Questions about Language Design - Страница 24
- The Roots of Lisp - Страница 28
- The Other Road Ahead - Страница 29
- What Made Lisp Different - Страница 44
- Why Arc Isn't Especially Object-Oriented - Страница 45
- Taste for Makers - Страница 46
- What Languages Fix - Страница 52
- Succinctness is Power - Страница 53
- Revenge of the Nerds - Страница 57
- A Plan for Spam - Страница 65
- Design and Research - Страница 72
- Better Bayesian Filtering - Страница 76
- Why Nerds are Unpopular - Страница 82
- The Hundred-Year Language - Страница 90
- If Lisp is So Great - Страница 97
- Hackers and Painters - Страница 98
- Filters that Fight Back - Страница 105
- What You Can't Say - Страница 107
- The Word "Hacker" - Страница 114
- The Python Paradox - Страница 117
- Great Hackers - Страница 118
- The Age of the Essay - Страница 125
- What the Bubble Got Right - Страница 131
- Bradley's Ghost - Страница 136
- Made in USA - Страница 137
- What You'll Wish You'd Known - Страница 140
- How to Start a Startup - Страница 147
- A Unified Theory of VC Suckagepad - Страница 159
- Undergraduation - Страница 161
- Writing, Briefly - Страница 166
- Return of the Mac - Страница 167
- Why Smart People Have Bad Ideas - Страница 169
- The Submarine - Страница 173
- Hiring is Obsolete - Страница 177
- What Business Can Learn from Open Source - Страница 183
- After the Ladder - Страница 189
- Inequality and Risk - Страница 190
- What I Did this Summer - Страница 194
- Ideas for Startups - Страница 198
- The Venture Capital Squeeze - Страница 203
- How to Fund a Startup - Страница 205
- Web 2.0 - Страница 217
- How to Make Wealth - Страница 222
- Good and Bad Procrastination - Страница 233
- How to Do What You Love - Страница 236
- Are Software Patents Evil? - Страница 242
- The Hardest Lessons for Startups to Learn - Страница 248
- How to Be Silicon Valley - Страница 255
- Why Startups Condense in America - Страница 260
- The Power of the Marginal - Страница 267
- The Island Test - Страница 275
- Copy What You Like - Страница 276
- How to Present to Investors - Страница 278
- A Student's Guide to Startups - Страница 282
- The 18 Mistakes That Kill Startups - Страница 290
- Mind the Gap - Страница 297
- How Art Can Be Good - Страница 305
- Learning from Founders - Страница 310
- Is It Worth Being Wise? - Страница 311
- Why to Not Not Start a Startup - Страница 316
- Microsoft is Dead - Страница 324
- Two Kinds of Judgement - Страница 326
- The Hacker's Guide to Investors - Страница 327
- An Alternative Theory of Unions - Страница 336
- The Equity Equation - Страница 337
- Stuff - Страница 339
- Holding a Program in One's Head - Страница 341
- How Not to Die - Страница 344
- News from the Front - Страница 347
- How to Do Philosophy - Страница 350
- The Future of Web Startups - Страница 357
- Why to Move to a Startup Hub - Страница 362
- Six Principles for Making New Things - Страница 364
- Trolls - Страница 366
- A New Venture Animal - Страница 368
- You Weren't Meant to Have a Boss - Страница 371
Once I understood how CRM114 worked, it seemed inevitable that I would eventually have to move from filtering based on single words to an approach like this. But first, I thought, I'll see how far I can get with single words. And the answer is, surprisingly far.
Mostly I've been working on smarter tokenization. On current spam, I've been able to achieve filtering rates that approach CRM114's. These techniques are mostly orthogonal to Bill's; an optimal solution might incorporate both.
``A Plan for Spam'' uses a very simple definition of a token. Letters, digits, dashes, apostrophes, and dollar signs are constituent characters, and everything else is a token separator. I also ignored case.
Now I have a more complicated definition of a token:
Case is preserved.
Exclamation points are constituent characters.
Periods and commas are constituents if they occur between two digits. This lets me get ip addresses and prices intact.
A price range like $20-25 yields two tokens, $20 and $25.
Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. E.g. ``foo'' in the Subject line becomes ``Subject*foo''. (The asterisk could be any character you don't allow as a constituent.)
Such measures increase the filter's vocabulary, which makes it more discriminating. For example, in the current filter, ``free'' in the Subject line has a spam probability of 98%, whereas the same token in the body has a spam probability of only 65%.
Here are some of the current probabilities [6]:
Subject*FREE 0.9999 free!! 0.9999 To*free 0.9998 Subject*free 0.9782 free! 0.9199 Free 0.9198 Url*free 0.9091 FREE 0.8747 From*free 0.7636 free 0.6546
In the Plan for Spam filter, all these tokens would have had the same probability, .7602. That filter recognized about 23,000 tokens. The current one recognizes about 187,000.
The disadvantage of having a larger universe of tokens is that there is more chance of misses. Spreading your corpus out over more tokens has the same effect as making it smaller. If you consider exclamation points as constituents, for example, then you could end up not having a spam probability for free with seven exclamation points, even though you know that free with just two exclamation points has a probability of 99.99%.
One solution to this is what I call degeneration. If you can't find an exact match for a token, treat it as if it were a less specific version. I consider terminal exclamation points, uppercase letters, and occurring in one of the five marked contexts as making a token more specific. For example, if I don't find a probability for ``Subject*free!'', I look for probabilities for ``Subject*free'', ``free!'', and ``free'', and take whichever one is farthest from .5.
Here are the alternatives [7] considered if the filter sees ``FREE!!!'' in the Subject line and doesn't have a probability for it.
Subject*Free!!! Subject*free!!! Subject*FREE! Subject*Free! Subject*free! Subject*FREE Subject*Free Subject*free FREE!!! Free!!! free!!! FREE! Free! free! FREE Free free
If you do this, be sure to consider versions with initial caps as well as all uppercase and all lowercase. Spams tend to have more sentences in imperative mood, and in those the first word is a verb. So verbs with initial caps have higher spam probabilities than they would in all lowercase. In my filter, the spam probability of ``Act'' is 98% and for ``act'' only 62%.
If you increase your filter's vocabulary, you can end up counting the same word multiple times, according to your old definition of ``same''. Logically, they're not the same token anymore. But if this still bothers you, let me add from experience that the words you seem to be counting multiple times tend to be exactly the ones you'd want to.
Another effect of a larger vocabulary is that when you look at an incoming mail you find more interesting tokens, meaning those with probabilities far from .5. I use the 15 most interesting to decide if mail is spam. But you can run into a problem when you use a fixed number like this. If you find a lot of maximally interesting tokens, the result can end up being decided by whatever random factor determines the ordering of equally interesting tokens. One way to deal with this is to treat some as more interesting than others.
For example, the token ``dalco'' occurs 3 times in my spam corpus and never in my legitimate corpus. The token ``Url*optmails'' (meaning ``optmails'' within a url) occurs 1223 times. And yet, as I used to calculate probabilities for tokens, both would have the same spam probability, the threshold of .99.
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318
- 319
- 320
- 321
- 322
- 323
- 324
- 325
- 326
- 327
- 328
- 329
- 330
- 331
- 332
- 333
- 334
- 335
- 336
- 337
- 338
- 339
- 340
- 341
- 342
- 343
- 344
- 345
- 346
- 347
- 348
- 349
- 350
- 351
- 352
- 353
- 354
- 355
- 356
- 357
- 358
- 359
- 360
- 361
- 362
- 363
- 364
- 365
- 366
- 367
- 368
- 369
- 370
- 371
- 372
- 373
- 374