smugmug statistics again
rutt
Registered Users Posts: 6,511 Major grins
I was playing with webalizer (a linux application that digests apache logs) and noticed that it gives a count of sites. Here is what the webalizer README says:
I know we've been over this before, but I don't understand the current status.
Sites
Each request made to the server comes from a unique 'site', which can
be referenced by a name or ultimately, an IP address. The 'sites'
number shows how many unique IP addresses made requests to the server
during the reporting time period. This DOES NOT mean the number of
unique individual users (real people) that visited, which is impossible
to determine using just logs and the HTTP protocol (however, this
number might be about as close as you will get).
This is exactly what I want. I'd like to see sites reported for each galler, subcategory, category and overall. Seeing a history of this would also be great.Each request made to the server comes from a unique 'site', which can
be referenced by a name or ultimately, an IP address. The 'sites'
number shows how many unique IP addresses made requests to the server
during the reporting time period. This DOES NOT mean the number of
unique individual users (real people) that visited, which is impossible
to determine using just logs and the HTTP protocol (however, this
number might be about as close as you will get).
I know we've been over this before, but I don't understand the current status.
- Do people really think this is a bad idea?
- IHow hard is this for the smugmug team to do?
- Are there unsolved implementation / algorithmic problems in the way of getting it done?
- Is it just very low priority because I'm the only one who has ever asked for it?
If not now, when?
0
Comments
We do get other people who more & better statistics so you're definitely not alone, it's just that the number is much smaller than requests for other things, some of which are pretty urgent.
Two of the more important ones may have been checked off last night, we'll see. One is a new move tool to arrange photos in a gallery. The existing ones were just too klunky.
The one we hear about the most is a new print-ordering interface with the ability to specify cropping, different cropping for each print size. There's a virtual chorus of customers banging our doors down for that and it turns out to be surprisingly difficult. If we could use a plugin or Flash, it wouldn't be so hard, but we got the message loud and clear that we can't require either.
And we hear about customization -- making galleries "skinable" and for mere mortals. Big job. And there's a lot of other stuff. The other thing about logs is we get so much traffic now they're huge and require a lot of computing to process them.
We're going to get to it, it's just getting some of the glaring big things off our plates first.
Thanks,
Baldy
Operating System Design, Drivers, Software
Villa Del Rio II, Talamban, Pit-os, Cebu, Ph
Speaking as someone who's been doing big sites for a long time now, you're right in that there is the potential for counter explosion but I think in practice you'll find that the count per user is usually very low - i.e. not many users will put up that much content. Only a small percentage (likely 1-5%) of users will be "big" users like that. Most will be small users, with only a handful of images - if that.
In any case it doesn't matter since somewhere in the database you already have one row per image. If you are running into a row count problem then you're going to hit it with or without the counters, and the counters add insignificant overhead: 4 bytes per counter per record, so probably on the order of 30-40 bytes for all the counters you really care about, which is inconsequential next to the size of the image data.
jimf@frostbytes.com
You can implement this without actually saving all the IP addresses. One old spelling dictionary trick (from when computer memory was much more expensive and limited than now) isto represent the dictionary as a farly large bit vector. Properly spelled words are entered into thte dictionary by hashsing them with multiple different hash functions and setting the resulting bits (perhaps 10 different hash functions are used.) A word is in the dictionary if the corresponding bit for all 10 of the hash functions is set. A suprisngly small bit vector can represent a suprisingly large number of words with a very low likelyhood of error (a false positive in this case.) I think websters required something like 10kB 10 get 99.99 accuracy, but my memory isn't really that good.
Baldy was quoted in USA Today as saying he had a little more than 6,000 smugmug subscribers. The competition has 100,000 and up. Also, a lot of users that I've seen on smugmug have pretty hefty galleries.
Catapultam habeo. Nisi pecuniam omnem mihi dabis, ad caput tuum saxum immane mittam
http://www.mcneel.com/users/jb/foghorn/ill_shut_up.au
But if it is the meaning, I guess we should understand the space requirements of the more standard database implmentations. If each user/categoy/subcategory has a database row, we'd need a column for each unique IP address that visits smugmug. I don't have a good feeling for how many this would be, but probably Baldy could tell us.
.
My guess is that this would be quite a sparse table. Most visitors to smugmug won't visit most of the galleries. So I've been trying to figure out something that takes advantage of this. But I'll admit that I'm sort of an old dog and more intune with data structure tricks than with database tricks.
I don't think the dictionary scheme is necessarily the best, but it has some advantages:
- Constant size cost per gallery (user/subcategory/category/photo, whatever)
- Degrades gracefully. More false positives as the number of IPs grows. That just means it will show less unique visitors than actually happen. There is already some noise in the data due to DHCP, firewalls, proxies, caches, etc.
- Remember that 10KB is large enough to hold Webster's Dictionary with very high accuracy. I'm guessing that we'd need a much smaller bit vector as no one of our sites would get as nearly as many unique visitors as Webster's has words.
Anway, it doesn't matter much. It was really just an existence proof of a clever representation that would hold this information effeciently.I believe PBase has around 100,000 subscribers and Webshots may be over a million.
Our growth was very slow in the beginning because it takes time to become known and trusted. We're averaging 50 new customers/day and 200 GB of storage a week now. We don't know what the future holds but 500/day and 2 terrabytes/week isn't hard to imagine anymore. :yikes
Nah, it was more like: Good lord, what are they talking about! Here are a couple more smiley speaks that might capture the feeling.
Catapultam habeo. Nisi pecuniam omnem mihi dabis, ad caput tuum saxum immane mittam
http://www.mcneel.com/users/jb/foghorn/ill_shut_up.au