Random thoughts from Jeffrey RSS 2.0
# Sunday, March 07, 2010

Here's the workflow I used for analyzing the logs from this website:

  1. Wait until end of day.
  2. Copy the day's log file to a temp directory.
  3. Run the log loading utility (this also applies the geolocation lookups, so sometimes the geoip databases need to be refreshed from www.maxmind.com)
  4. After a bit (3-20 minutes usually; depends highly on the level of traffic), the log entries are all in a SQL Server database.
  5. The database has a View that filters out bots, crawlers, spammers, and internal traffic
  6. I view the external user records by querying the view.

That view has a horribly complicated SELECT statement. Which I found out this week had some bugs, so not all results were being correctly returned. And by "horribly" complicated I mean that it has thousands of conditions that are being evaluated.

So after wasting a bunch of time trying to chase down where the problems were, I decided to scrap that approach and come up with a better one.

What came to mind was developing some sort of "how-likely-is-it-that-this-record-should-be-hidden" score. The more pieces of "evidence" that a particular request came from a bot/crawler/spammer/etc., the higher the score.

So now I've got a basic implementation going. It's written in C# 4.0 (hey, have to play with the new stuff sometime!) and operates as a separate external utility that persists the score as another field on each log entry's record. It took that massive SELECT and refactored it down into 45 separate rule sets (classes)...much more manageable! At the moment the scores from each rule are kind of arbitrary, and will probably need to be redone/tweaked in the future. Right now I'm basically taking everything that didn't match a rule (score = 0) and treating that as legitimate external traffic...which seems to be working fairly well, but isn't really as fine grained as I originally envisioned.

Also, at some point (soon) I need to add more complex conditions. A couple of bots operate in such a way that if you look at any one individual request to the web server, that request is legitimate. But as soon as you see, say, 4 requests, repetitive patterns start to emerge and it becomes obvious that some sort of crawling is going on. So having an automated way to catch these would be nice...but also more complicated...probably just haven't thought about it enough yet...

Coolest parts of doing the new implementation: Linq to SQL, & using Linq + reflection to automatically discover all the rule sets. Just a couple lines of code to do such complex things! And it's so much simpler with that syntax!

Now playing: In-Flight Safety – We Are An Empire, My Dear – 05 Torches

Sunday, March 07, 2010 04:19:02 UTC  #    Comments [0] -
IT
About the author
Jeffrey Stults
Jeffrey Stults is a software developer currently in Portland, Oregon. He is contactable at:
stultsj@ntldr.net
Archive
<March 2010>
SunMonTueWedThuFriSat
28123456
78910111213
14151617181920
21222324252627
28293031123
45678910
Disclaimer

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

© Copyright 2012
Jeffrey Stults, Jr.
Statistics
Total Posts: 256
This Year: 0
This Month: 0
This Week: 0
Comments: 23
Utilities
Pick a theme:
Sign In