smugmug statistics again

ruttrutt Registered Users Posts: 6,511 Major grins
edited March 6, 2004 in SmugMug Support
I was playing with webalizer (a linux application that digests apache logs) and noticed that it gives a count of sites. Here is what the webalizer README says:
Sites

Each request made to the server comes from a unique 'site', which can
be referenced by a name or ultimately, an IP address. The 'sites'
number shows how many unique IP addresses made requests to the server
during the reporting time period. This DOES NOT mean the number of
unique individual users (real people) that visited, which is impossible
to determine using just logs and the HTTP protocol (however, this
number might be about as close as you will get).
This is exactly what I want. I'd like to see sites reported for each galler, subcategory, category and overall. Seeing a history of this would also be great.

I know we've been over this before, but I don't understand the current status.
  1. Do people really think this is a bad idea?
  2. IHow hard is this for the smugmug team to do?
  3. Are there unsolved implementation / algorithmic problems in the way of getting it done?
  4. Is it just very low priority because I'm the only one who has ever asked for it?
If not now, when?

Comments

  • BaldyBaldy Registered Users, Super Moderators Posts: 2,853 moderator
    edited March 4, 2004
    rutt wrote:
    I know we've been over this before, but I don't understand the current status.
    1. Do people really think this is a bad idea?
    2. IHow hard is this for the smugmug team to do?
    3. Are there unsolved implementation / algorithmic problems in the way of getting it done?
    4. Is it just very low priority because I'm the only one who has ever asked for it?
    Good pointer, rutt. Anything we can do to leverage open source code to ease the workload is great.


    We do get other people who more & better statistics so you're definitely not alone, it's just that the number is much smaller than requests for other things, some of which are pretty urgent.


    Two of the more important ones may have been checked off last night, we'll see. One is a new move tool to arrange photos in a gallery. The existing ones were just too klunky.


    The one we hear about the most is a new print-ordering interface with the ability to specify cropping, different cropping for each print size. There's a virtual chorus of customers banging our doors down for that and it turns out to be surprisingly difficult. If we could use a plugin or Flash, it wouldn't be so hard, but we got the message loud and clear that we can't require either.


    And we hear about customization -- making galleries "skinable" and for mere mortals. Big job. And there's a lot of other stuff. The other thing about logs is we get so much traffic now they're huge and require a lot of computing to process them.


    We're going to get to it, it's just getting some of the glaring big things off our plates first.


    Thanks,
    Baldy
  • ruttrutt Registered Users Posts: 6,511 Major grins
    edited March 4, 2004
    Baldy wrote:
    Good pointer, rutt. Anything we can do to leverage open source code to ease the workload is great.

    The other thing about logs is we get so much traffic now they're huge and require a lot of computing to process them.
    OK, this sounds like an implementation/algorithm issue, instead of just haggling over price. I would keep these counts incrementally somehow. I don't know enough about your system to design this, but I't try to spend a small amount of cpu power keeping the statistics up to date all the time instead of crunching giant things every so often.
    If not now, when?
  • cmr164cmr164 Registered Users Posts: 1,542 Major grins
    edited March 4, 2004
    rutt wrote:
    OK, this sounds like an implementation/algorithm issue, instead of just haggling over price. I would keep these counts incrementally somehow. I don't know enough about your system to design this, but I't try to spend a small amount of cpu power keeping the statistics up to date all the time instead of crunching giant things every so often.
    I use the Webalizer tool sometimes and the problem is that the database required to divvy up the stats in the way that folks are asking Baldy to do implies a pretty complex and big DB. Lets say that he has 1000 subscribers (could be 10k or even more) and each of those customers has 400 online images (I have 419 just for dgrin) Add in the different size images and navigation pages and Baldy has more than i million counter that he needs to keep track of inorder to just give folks a counter for each image.
    Charles Richmond IT & Security Consultant
    Operating System Design, Drivers, Software
    Villa Del Rio II, Talamban, Pit-os, Cebu, Ph
  • jimfjimf Registered Users Posts: 338 Major grins
    edited March 4, 2004
    cmr164 wrote:
    I use the Webalizer tool sometimes and the problem is that the database required to divvy up the stats in the way that folks are asking Baldy to do implies a pretty complex and big DB. Lets say that he has 1000 subscribers (could be 10k or even more) and each of those customers has 400 online images (I have 419 just for dgrin) Add in the different size images and navigation pages and Baldy has more than i million counter that he needs to keep track of inorder to just give folks a counter for each image.

    Speaking as someone who's been doing big sites for a long time now, you're right in that there is the potential for counter explosion but I think in practice you'll find that the count per user is usually very low - i.e. not many users will put up that much content. Only a small percentage (likely 1-5%) of users will be "big" users like that. Most will be small users, with only a handful of images - if that.

    In any case it doesn't matter since somewhere in the database you already have one row per image. If you are running into a row count problem then you're going to hit it with or without the counters, and the counters add insignificant overhead: 4 bytes per counter per record, so probably on the order of 30-40 bytes for all the counters you really care about, which is inconsequential next to the size of the image data.
    jim frost
    jimf@frostbytes.com
  • ruttrutt Registered Users Posts: 6,511 Major grins
    edited March 4, 2004
    It would be nice to have statistics per image, but I was only imagining per gallery (as now.) Unfortunately, the information we want requires both a counter and some representation of the IP addresses that have visited the gallery (subcategory, category, photographer site).

    You can implement this without actually saving all the IP addresses. One old spelling dictionary trick (from when computer memory was much more expensive and limited than now) isto represent the dictionary as a farly large bit vector. Properly spelled words are entered into thte dictionary by hashsing them with multiple different hash functions and setting the resulting bits (perhaps 10 different hash functions are used.) A word is in the dictionary if the corresponding bit for all 10 of the hash functions is set. A suprisngly small bit vector can represent a suprisingly large number of words with a very low likelyhood of error (a false positive in this case.) I think websters required something like 10kB 10 get 99.99 accuracy, but my memory isn't really that good.
    If not now, when?
  • gusgus Registered Users Posts: 16,209 Major grins
    edited March 5, 2004
    eek7.gif
  • wxwaxwxwax Registered Users Posts: 15,471 Major grins
    edited March 5, 2004
    Humungus wrote:
    eek7.gif


    nod.gif

    rolleyes1.gif

    Baldy was quoted in USA Today as saying he had a little more than 6,000 smugmug subscribers. The competition has 100,000 and up. Also, a lot of users that I've seen on smugmug have pretty hefty galleries.
    Sid.
    Catapultam habeo. Nisi pecuniam omnem mihi dabis, ad caput tuum saxum immane mittam
    http://www.mcneel.com/users/jb/foghorn/ill_shut_up.au
  • ruttrutt Registered Users Posts: 6,511 Major grins
    edited March 5, 2004
    wxwax wrote:
    nod.gif

    rolleyes1.gif

    Baldy was quoted in USA Today as saying he had a little more than 6,000 smugmug subscribers. The competition has 100,000 and up. Also, a lot of users that I've seen on smugmug have pretty hefty galleries.
    I don't really speak smiley, so I can only guess that the last two messages mean that people are worried about the space usage of the spelling dictionary implementation. If this isn't the meaning, nevermind.

    But if it is the meaning, I guess we should understand the space requirements of the more standard database implmentations. If each user/categoy/subcategory has a database row, we'd need a column for each unique IP address that visits smugmug. I don't have a good feeling for how many this would be, but probably Baldy could tell us.
    .
    My guess is that this would be quite a sparse table. Most visitors to smugmug won't visit most of the galleries. So I've been trying to figure out something that takes advantage of this. But I'll admit that I'm sort of an old dog and more intune with data structure tricks than with database tricks.

    I don't think the dictionary scheme is necessarily the best, but it has some advantages:
    1. Constant size cost per gallery (user/subcategory/category/photo, whatever)
    2. Degrades gracefully. More false positives as the number of IPs grows. That just means it will show less unique visitors than actually happen. There is already some noise in the data due to DHCP, firewalls, proxies, caches, etc.
    3. Remember that 10KB is large enough to hold Webster's Dictionary with very high accuracy. I'm guessing that we'd need a much smaller bit vector as no one of our sites would get as nearly as many unique visitors as Webster's has words.
    Anway, it doesn't matter much. It was really just an existence proof of a clever representation that would hold this information effeciently.
    If not now, when?
  • BaldyBaldy Registered Users, Super Moderators Posts: 2,853 moderator
    edited March 5, 2004
    wxwax wrote:
    nod.gif

    rolleyes1.gif

    Baldy was quoted in USA Today as saying he had a little more than 6,000 smugmug subscribers. The competition has 100,000 and up. Also, a lot of users that I've seen on smugmug have pretty hefty galleries.
    Yeah, we're actually around 7,000 with 2.7 million photos = 385 photos/user. It varies, because rutt for example has 418 galleries with 16849 photos (you can see how many users have by using the search form).

    I believe PBase has around 100,000 subscribers and Webshots may be over a million.

    Our growth was very slow in the beginning because it takes time to become known and trusted. We're averaging 50 new customers/day and 200 GB of storage a week now. We don't know what the future holds but 500/day and 2 terrabytes/week isn't hard to imagine anymore. :yikes
  • wxwaxwxwax Registered Users Posts: 15,471 Major grins
    edited March 6, 2004
    rutt wrote:
    I don't really speak smiley, so I can only guess that the last two messages mean that people are worried about the space usage of the spelling dictionary implementation. If this isn't the meaning, nevermind.

    Nah, it was more like: Good lord, what are they talking about! Here are a couple more smiley speaks that might capture the feeling.
    headscratch.gifne_nau.gif
    Sid.
    Catapultam habeo. Nisi pecuniam omnem mihi dabis, ad caput tuum saxum immane mittam
    http://www.mcneel.com/users/jb/foghorn/ill_shut_up.au
Sign In or Register to comment.