A few months ago
Steve and I noted that our respective sites were getting tons of
hits from Samara Oblast, an obscure(?) territory in Russia. Russian search engine maybe? Cybercriminals? Proxy for the American or Chinese or Syrian electronic armies? Who really cares? Only port 80 should be open and doing nothing fancy
But since this kilroy thing has gotten pretty lengthy I was scoping the possibility of doing some sort of
'top content' thing based on hits. So I pulled my server logs and was looking through them to see how hard it'd be to parse.
Attack surface
Well this is fun:
91.200.13.119 "GET /kilroy/archive/2008/04/index.html HTTP/1.0"...
91.200.13.119 "GET /kilroy/2008/01/leader-board-r.html HTTP/1.0"...
91.200.13.119 "GET /kilroy/2008/01/index.php HTTP/1.0"...
91.200.13.119 "GET /2008/01/index.php HTTP/1.0"...
91.200.13.119 "GET /kilroy/2008/01/index.php HTTP/1.0"...
91.200.13.119 "GET /2008/01/index.php HTTP/1.0"...
91.200.13.119 "GET /kilroy/2008/01/index.php HTTP/1.0"...
91.200.13.119 "GET /2008/01/index.php HTTP/1.0"...
How am I going to count hits for 2008/01/index.php when there is no anything.php?
Eight sequential hits from the same person, within 10 seconds. That's what I call quick on the mouse. Whois says it's from Ukraine. I'm going to stop me right here, this is my first time actually looking at http traffic, this is old hat to 80% of the world. Okay, let's continue.
- First they hit an index page that actually works.
- Then another page which may or may not be linked from the first page.
- Now six sequential hits looking for index.php in various legitimate and illegitimate directories.
Maybe they're just guessing about site map, but probably
they're looking to have some fun with php.
Another interesting one:
POST /cgi-bin/php?
%2D%64+%61%6C%6C%6F%77%5F%75%72%6C%5F%69%6E%63%6C%75%64%65
%3D%6F%6E+%2D%64+%73%61%66%65%5F%6D%6F%64%65%3D%6F%66%66+%2D%64+%73%75%68%6
F
%73%69%6E%2E%73%69%6D%75%6C%61%74%69%6F%6E%3D%6F%6E+%2D%64+%64%69%73%61%62
%6C%65%5F%66%75%6E%63%74%69%6F%6E%73%3D%22%22+%2D%64+%6F%70%65%6E%5F%62%61
%73%65%64%69%72%3D%6E%6F%6E%65+%2D%64+%61%75%74%6F%5F%70%72%65%70%65%6E%64
%5F%66%69%6C%65%3D%70%68%70%3A%2F%2F%69%6E%70%75%74+%2D%64+%63%67%69%2E%66
%6F%72%63%65%5F%72%65%64%69%72%65%63%74%3D%30+%2D%64+%63%67%69%2E%72%65%64
%69%72%65%63%74%5F%73%74%61%74%75%73%5F%65%6E%76%3D%30+%2D%6E HTTP/1.1
Looking to do injection or overflow or something? Not really my wheelhouse, but it was kind of a fun digression.
Classifier
So I wrote some code to
classify site traffic into one of the following categories:
- Attacks: malicious traffic like the stuff depicted above.
- Bots: googlebot, baidubot, hedonismbot, etc.
- Visits: everything else, I assume, is either a person with a browser or a bot trying to emulate one. They are probably equally entertained.
Some of it was pretty easy, bots tend to declare themselves in the user agent string and hit robots.txt first. Malicious stuff sends PUTs and looks for files that aren't .html/.jpg/etc. And, of course, sequential traffic from the same IP can be classified together. This is important because an attack might hit numerous legit links but it's not visit traffic.
Data
Logs go back about a year. Here's some excel because easy.
I get indexed about twice as much as I get visited. There have been more than 20,000 malicious http requests.
Google, Baidu, and Majestic 12 (a distributed indexing project) turned up most. But there are quite a few bots out there.
So the top visited content, the main reason for this whole endeavor:
Pages
Images
Labels - which are now just links to search
Data skew: some content has been around longer. On the other hand, the logs are only from about a year back.
When I get some more fun-coding time I'll
see about putting this in the sidebar.
Some posts from this site with similar content.
(and some select mainstream web). I haven't personally looked at them or checked them for quality, decency, or sanity. None of these links are promoted, sponsored, or affiliated with this site. For more information, see
.