1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
|
1. Title: SPAM E-mail Database
2. Sources:
(a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt
Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
(b) Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835
(c) Generated: June-July 1999
3. Past Usage:
(a) Hewlett-Packard Internal-only Technical Report. External forthcoming.
(b) Determine whether a given email is spam or not.
(c) ~7% misclassification error.
False positives (marking good mail as spam) are very undesirable.
If we insist on zero false positives in the training/testing set,
20-25% of the spam passed through the filter.
4. Relevant Information:
The "spam" concept is diverse: advertisements for products/web
sites, make money fast schemes, chain letters, pornography...
Our collection of spam e-mails came from our postmaster and
individuals who had filed spam. Our collection of non-spam
e-mails came from filed work and personal e-mails, and hence
the word 'george' and the area code '650' are indicators of
non-spam. These are useful when constructing a personalized
spam filter. One would either have to blind such non-spam
indicators or get a very wide collection of non-spam to
generate a general purpose spam filter.
For background on spam:
Cranor, Lorrie F., LaMacchia, Brian A. Spam!
Communications of the ACM, 41(8):74-83, 1998.
5. Number of Instances: 4601 (1813 Spam = 39.4%)
6. Number of Attributes: 58 (57 continuous, 1 nominal class label)
7. Attribute Information:
The last column of 'spambase.data' denotes whether the e-mail was
considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Most of the attributes indicate whether a particular word or
character was frequently occuring in the e-mail. The run-length
attributes (55-57) measure the length of sequences of consecutive
capital letters. For the statistical measures of each attribute,
see the end of this file. Here are the definitions of the attributes:
48 continuous real [0,100] attributes of type word_freq_WORD
= percentage of words in the e-mail that match WORD,
i.e. 100 * (number of times the WORD appears in the e-mail) /
total number of words in e-mail. A "word" in this case is any
string of alphanumeric characters bounded by non-alphanumeric
characters or end-of-string.
6 continuous real [0,100] attributes of type char_freq_CHAR
= percentage of characters in the e-mail that match CHAR,
i.e. 100 * (number of CHAR occurences) / total characters in e-mail
1 continuous real [1,...] attribute of type capital_run_length_average
= average length of uninterrupted sequences of capital letters
1 continuous integer [1,...] attribute of type capital_run_length_longest
= length of longest uninterrupted sequence of capital letters
1 continuous integer [1,...] attribute of type capital_run_length_total
= sum of length of uninterrupted sequences of capital letters
= total number of capital letters in the e-mail
1 nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0),
i.e. unsolicited commercial e-mail.
8. Missing Attribute Values: None
9. Class Distribution:
Spam 1813 (39.4%)
Non-Spam 2788 (60.6%)
Attribute Statistics:
Min: Max: Average: Std.Dev: Coeff.Var_%:
1 0 4.54 0.10455 0.30536 292
2 0 14.28 0.21301 1.2906 606
3 0 5.1 0.28066 0.50414 180
4 0 42.81 0.065425 1.3952 2130
5 0 10 0.31222 0.67251 215
6 0 5.88 0.095901 0.27382 286
7 0 7.27 0.11421 0.39144 343
8 0 11.11 0.10529 0.40107 381
9 0 5.26 0.090067 0.27862 309
10 0 18.18 0.23941 0.64476 269
11 0 2.61 0.059824 0.20154 337
12 0 9.67 0.5417 0.8617 159
13 0 5.55 0.09393 0.30104 320
14 0 10 0.058626 0.33518 572
15 0 4.41 0.049205 0.25884 526
16 0 20 0.24885 0.82579 332
17 0 7.14 0.14259 0.44406 311
18 0 9.09 0.18474 0.53112 287
19 0 18.75 1.6621 1.7755 107
20 0 18.18 0.085577 0.50977 596
21 0 11.11 0.80976 1.2008 148
22 0 17.1 0.1212 1.0258 846
23 0 5.45 0.10165 0.35029 345
24 0 12.5 0.094269 0.44264 470
25 0 20.83 0.5495 1.6713 304
26 0 16.66 0.26538 0.88696 334
27 0 33.33 0.7673 3.3673 439
28 0 9.09 0.12484 0.53858 431
29 0 14.28 0.098915 0.59333 600
30 0 5.88 0.10285 0.45668 444
31 0 12.5 0.064753 0.40339 623
32 0 4.76 0.047048 0.32856 698
33 0 18.18 0.097229 0.55591 572
34 0 4.76 0.047835 0.32945 689
35 0 20 0.10541 0.53226 505
36 0 7.69 0.097477 0.40262 413
37 0 6.89 0.13695 0.42345 309
38 0 8.33 0.013201 0.22065 1670
39 0 11.11 0.078629 0.43467 553
40 0 4.76 0.064834 0.34992 540
41 0 7.14 0.043667 0.3612 827
42 0 14.28 0.13234 0.76682 579
43 0 3.57 0.046099 0.22381 486
44 0 20 0.079196 0.62198 785
45 0 21.42 0.30122 1.0117 336
46 0 22.05 0.17982 0.91112 507
47 0 2.17 0.0054445 0.076274 1400
48 0 10 0.031869 0.28573 897
49 0 4.385 0.038575 0.24347 631
50 0 9.752 0.13903 0.27036 194
51 0 4.081 0.016976 0.10939 644
52 0 32.478 0.26907 0.81567 303
53 0 6.003 0.075811 0.24588 324
54 0 19.829 0.044238 0.42934 971
55 1 1102.5 5.1915 31.729 611
56 1 9989 52.173 194.89 374
57 1 15841 283.29 606.35 214
58 0 1 0.39404 0.4887 124
This file: 'spambase.DOCUMENTATION' at the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
|