Essays
Добавить в закладки К обложке
- Programming Bottom-Up - Страница 1
- Lisp for Web-Based Applications - Страница 3
- Beating the Averages - Страница 6
- Java's Cover - Страница 12
- Being Popular - Страница 14
- Five Questions about Language Design - Страница 24
- The Roots of Lisp - Страница 28
- The Other Road Ahead - Страница 29
- What Made Lisp Different - Страница 44
- Why Arc Isn't Especially Object-Oriented - Страница 45
- Taste for Makers - Страница 46
- What Languages Fix - Страница 52
- Succinctness is Power - Страница 53
- Revenge of the Nerds - Страница 57
- A Plan for Spam - Страница 65
- Design and Research - Страница 72
- Better Bayesian Filtering - Страница 76
- Why Nerds are Unpopular - Страница 82
- The Hundred-Year Language - Страница 90
- If Lisp is So Great - Страница 97
- Hackers and Painters - Страница 98
- Filters that Fight Back - Страница 105
- What You Can't Say - Страница 107
- The Word "Hacker" - Страница 114
- The Python Paradox - Страница 117
- Great Hackers - Страница 118
- The Age of the Essay - Страница 125
- What the Bubble Got Right - Страница 131
- Bradley's Ghost - Страница 136
- Made in USA - Страница 137
- What You'll Wish You'd Known - Страница 140
- How to Start a Startup - Страница 147
- A Unified Theory of VC Suckagepad - Страница 159
- Undergraduation - Страница 161
- Writing, Briefly - Страница 166
- Return of the Mac - Страница 167
- Why Smart People Have Bad Ideas - Страница 169
- The Submarine - Страница 173
- Hiring is Obsolete - Страница 177
- What Business Can Learn from Open Source - Страница 183
- After the Ladder - Страница 189
- Inequality and Risk - Страница 190
- What I Did this Summer - Страница 194
- Ideas for Startups - Страница 198
- The Venture Capital Squeeze - Страница 203
- How to Fund a Startup - Страница 205
- Web 2.0 - Страница 217
- How to Make Wealth - Страница 222
- Good and Bad Procrastination - Страница 233
- How to Do What You Love - Страница 236
- Are Software Patents Evil? - Страница 242
- The Hardest Lessons for Startups to Learn - Страница 248
- How to Be Silicon Valley - Страница 255
- Why Startups Condense in America - Страница 260
- The Power of the Marginal - Страница 267
- The Island Test - Страница 275
- Copy What You Like - Страница 276
- How to Present to Investors - Страница 278
- A Student's Guide to Startups - Страница 282
- The 18 Mistakes That Kill Startups - Страница 290
- Mind the Gap - Страница 297
- How Art Can Be Good - Страница 305
- Learning from Founders - Страница 310
- Is It Worth Being Wise? - Страница 311
- Why to Not Not Start a Startup - Страница 316
- Microsoft is Dead - Страница 324
- Two Kinds of Judgement - Страница 326
- The Hacker's Guide to Investors - Страница 327
- An Alternative Theory of Unions - Страница 336
- The Equity Equation - Страница 337
- Stuff - Страница 339
- Holding a Program in One's Head - Страница 341
- How Not to Die - Страница 344
- News from the Front - Страница 347
- How to Do Philosophy - Страница 350
- The Future of Web Startups - Страница 357
- Why to Move to a Startup Hub - Страница 362
- Six Principles for Making New Things - Страница 364
- Trolls - Страница 366
- A New Venture Animal - Страница 368
- You Weren't Meant to Have a Boss - Страница 371
Ideally, of course, the probabilities should be calculated individually for each user. I get a lot of email containing the word "Lisp", and (so far) no spam that does. So a word like that is effectively a kind of password for sending mail to me. In my earlier spam-filtering software, the user could set up a list of such words and mail containing them would automatically get past the filters. On my list I put words like "Lisp" and also my zipcode, so that (otherwise rather spammy-sounding) receipts from online orders would get through. I thought I was being very clever, but I found that the Bayesian filter did the same thing for me, and moreover discovered of a lot of words I hadn't thought of.
When I said at the start that our filters let through less than 5 spams per 1000 with 0 false positives, I'm talking about filtering my mail based on a corpus of my mail. But these numbers are not misleading, because that is the approach I'm advocating: filter each user's mail based on the spam and nonspam mail he receives. Essentially, each user should have two delete buttons, ordinary delete and delete-as-spam. Anything deleted as spam goes into the spam corpus, and everything else goes into the nonspam corpus.
You could start users with a seed filter, but ultimately each user should have his own per-word probabilities based on the actual mail he receives. This (a) makes the filters more effective, (b) lets each user decide their own precise definition of spam, and (c) perhaps best of all makes it hard for spammers to tune mails to get through the filters. If a lot of the brain of the filter is in the individual databases, then merely tuning spams to get through the seed filters won't guarantee anything about how well they'll get through individual users' varying and much more trained filters.
Content-based spam filtering is often combined with a whitelist, a list of senders whose mail can be accepted with no filtering. One easy way to build such a whitelist is to keep a list of every address the user has ever sent mail to. If a mail reader has a delete-as-spam button then you could also add the from address of every email the user has deleted as ordinary trash.
I'm an advocate of whitelists, but more as a way to save computation than as a way to improve filtering. I used to think that whitelists would make filtering easier, because you'd only have to filter email from people you'd never heard from, and someone sending you mail for the first time is constrained by convention in what they can say to you. Someone you already know might send you an email talking about sex, but someone sending you mail for the first time would not be likely to. The problem is, people can have more than one email address, so a new from-address doesn't guarantee that the sender is writing to you for the first time. It is not unusual for an old friend (especially if he is a hacker) to suddenly send you an email with a new from-address, so you can't risk false positives by filtering mail from unknown addresses especially stringently.
In a sense, though, my filters do themselves embody a kind of whitelist (and blacklist) because they are based on entire messages, including the headers. So to that extent they "know" the email addresses of trusted senders and even the routes by which mail gets from them to me. And they know the same about spam, including the server names, mailer versions, and protocols.
_ _ _If I thought that I could keep up current rates of spam filtering, I would consider this problem solved. But it doesn't mean much to be able to filter out most present-day spam, because spam evolves. Indeed, most antispam techniques so far have been like pesticides that do nothing more than create a new, resistant strain of bugs.
I'm more hopeful about Bayesian filters, because they evolve with the spam. So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice. Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.
Still, anyone who proposes a plan for spam filtering has to be able to answer the question: if the spammers knew exactly what you were doing, how well could they get past you? For example, I think that if checksum-based spam filtering becomes a serious obstacle, the spammers will just switch to mad-lib techniques for generating message bodies.
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318
- 319
- 320
- 321
- 322
- 323
- 324
- 325
- 326
- 327
- 328
- 329
- 330
- 331
- 332
- 333
- 334
- 335
- 336
- 337
- 338
- 339
- 340
- 341
- 342
- 343
- 344
- 345
- 346
- 347
- 348
- 349
- 350
- 351
- 352
- 353
- 354
- 355
- 356
- 357
- 358
- 359
- 360
- 361
- 362
- 363
- 364
- 365
- 366
- 367
- 368
- 369
- 370
- 371
- 372
- 373
- 374