User generated content can be used on your web sites for almost anything like surveys, comments, ratings, etc. The CQ5 OOTB collab components (forums, forms, ratings, comments) generate content which stored under /content/usergenerated, then (if reverse replication is configured) that data is pulled back by the Author via polling. The content created by these then have to be moderated and published back out before they are available to end users. I get that if you are running a web site where there are strict controls over every change on your web, even by your end users.
But if you look at modern sites which are handling large groups of users interacting with your site via likes and comments, do you really need to moderate them all? And if you don’t need to moderate them all, what mechanism would you use to get them replicated across all your CQ5 publish instances? I’ve spoken to a few people (including those at Adobe) and it seems the lazy answer is some combination of reverse replication and workflows to auto-publish them back out. In my humble opinion, that sounds rather clunky, not to mention a disaster waiting to happen by blocking normal publish jobs in replication queues and causing unnecessary dispatcher flushing should there be large sets of user generated data that are flowing back and forth between the publishers to author and back out. It doesn’t gel with the philosophy of keeping everything simple and lightweight that sling and cq5 developers espouse.
So what’s the alternative, you ask? I’m afraid I have no easy answers there either because no matter how you slice this thing you need to solve two things:
1. Capture your user generated content in a fast and easy way that ties that data with the resource in CQ5 (e.g. comments on a specific page, or comments on comments?)
2. Optionally process that data to get rid of spam or protect against abusive requests
3. Surface that content in the same consistent manner no matter which publish server the user ends up hitting.
4. Shouldn’t affect the performance of authoring operations or rendering of content from CQ5 (especially not flushing dispatcher caches unnecessarily)
With these requirements, we can try a number of options but there is so much more to answer:
1. Maybe pick a store which IS good at auto replication across the data centers (Most clients I work with have data centers on different continents). How does MongoDB or Cassandra perform for this? Or say we don’t need replication, can we use data center specific dbs and shard data?
2. Check what the model of data you’d need for your user generated content. For stuff like comments on comments, a JSON document store like MongoDB might be ok but what if your data is tons of flat lists with relations? Definitely don’t want to re-invent the wheel there. Can we use SQL dbs that support sharding?
3. Do we store data in those systems directly over API outside of CQ5 or do we write servlets/components in CQ5 which would proxy to the appropriate data store?
4. When we read the content, should we read directly from those systems or add caching/search indexes for them?
I guess the point I’m trying to make is that the sites/applications we are writing on CQ5 are starting to need features and scalability normally found in the the bigger custom-developed social networking sites. Is there a way we can have poor man versions of those without breaking CQ5? We need to rethink the one author multiple publishers containing the same code/content architecture for some of these other use cases.