Sergey Sheinblu...'s profileSergey's BI SpaceBlogLists Tools Help

Blog

    November 30

    Internet companies and Data Crisis

    After research on start up companies in Bay Area that are developing Web2.0 (SAAS) applications or social networking portals I've decided to put some comments that on my opinion define 'Data crisis'.
    Start up companies do not care about well thought data models. Why?
    The nature of growing the business is to build quick interfaces for users to get as many users as possible, as many pages as possible, especially if company is making $ per CPM adv.model - more web pages, more potential 'ad displays'.
    During this growth of user base the requirements on data accuracy, single point of truth, uniqueness of records and overall data processing as well as data modeling is completely neglected.
    That leads to redundancy, inacurate data, duplicates, ineffective data processing, wrong (quick not well thought and prepared to be enhanced without 'redeveloping' but ADDING the features) data solutions and etc...
    Cost and time to market drives 'dirty' development.
    Next step in the game is that company can not ad features therefore can not grow the business not mentioning that analysis on how business is doing becomes impossible, and if possible than it is very expensive exercise.
    Sometimes, and most of the times, the company will be continue operations hiring more and more developers, than contractors (to be blamed for complete mess) or outsource the projects to India-China-Russia, and ...if no VC funds can be secured to continue the support the applications -than most likely it will be slowly bleeding on $ and soon be going 'belly up' as we say.
    That is the story that is going on and on, as companies are trying to make quick buck on other peoples money , by selling the complete BS to large corporations...Why large corporations are 'buying' it?
    Most of the times because they are in 'Data Crisis' situation on their own. For a different reasons - Old software, old technologies, so called 'technology debt', huge expense to build  New software solutions and integrate it into ERP.
    Lots of money , lots of hassle, lots of failure and of course NEW political directions/initiatives that needed to be supported
    by lower echalon - managers, developers and etc.... Why not to buy startup? but that is another story.
    THe point is that 'Data Crisis' is hapening and growing in the most of industries as they are dealing with all of the above.
    And I'd like to emphasize that common issue and the problem - lack of normalized data model and lack of data platform architecture leads to inability to operate effectively with business.
    In 'Data Crisis' situation starups are eager to find 'buyer', large corporation are looking for a startups who has some kind of New idea prototypes. win win situation? not exactly. Politically it is .Technology wise - not at all.
    Repeating the same mistakes won't bring new breath into technologies. Creating 'Data Crisis' situation is repeating the same mistakes...for different verticals or industries.
    take a look at social networking for example. huge potential to collect data and make it work for improving sales, analysis of users, their behavior on internet (beside porn and stock market) and etc...
    But most social networks are surviving on the only adsence business - display advertisement.
    Why? because they can't process data and make solutions to analyse it.
    Why? because of all of the above...and of course, some small problem - lack of standard , base, techonology to scale data.
      
    I have researched 169 companies for last half year in bay area and san francisco.
    Talked, chatted, met 27 start ups asking questions/researching/going through two ways interviews, trying to find the company that had paid attention to data platform and have architected solutions  to scale data and be able to grow; and at the same time to prepare data repositories with raw data for analysis to improve marketing, ROI, monetization, user base, quality of data, avoid fraud and spam and etc...
    2 companies that I have researched , Quantcast and Zvents, went right direction (my personal opinion) having clear understanding of need to develop DATA PLATFORM 'in-house' to scale, store, query data with volumes in a range of 1 billion+ transactions a day.
    Unbelievable but true. ONLY 2 of 169 !!!!!!!!!
    Why?
    Because software architecture process does not exist and had not been even considered as direction, not mentioning data architecture or data modeling or architecture for data processing and etc...
    I talked to one very populare company that runs apps on facebook. The hundreds millions of users can't be converted in $ except adsence. The reason is very simple - they did not see the need to invest into developing data platform and data warehouse.
    Why? Vp of marketing had sarcasticly mentioned:"at ebay we've spent millions of dollars to be able to provide reports and only 2 users are using DW."
    So they went with ad-hoc queiries on tera of data - good luck.
    This is the failure of management not technology or lack of technology. That does give an example of how 'Data Crisis' situation can be also created 'by design' as a result of lack technical vision or maybe 'bad' experience. Once you've burned you never try again. Not me though...No wonder ebay needed to buy money processing software as it can't fix bleeding paypal.
    I can continue share examples of what not to do and who not to hire to avoid 'Data Crisis' but I think it would be better to go for example and how to deal with this situation.

    I'd like to give simple example of components that supposed to be a part of architecture but most companies have neglected
    the abstract principles of architecture ...and therefore, have been paying the high price for it...basically , in my words are in a mess of data and data operations that are defined here as 'data crisis'...

     Following are simplified architecture approach to build the web based applications:
    Let's say it is for marketing-advertisement business on internet or hosting SAAS applications.
    Let's define simple bricks-components of software foundation that need to be architected into software application:
    1.Front end (Browser-client components)
    2. middle tier business components (server side to support UI - front end components)
    3.application server components ( business rules execution, data connections,data integration, data collection, queries,data returned by  queries, data manipulation to serve front end or back end (ODS),),
    4.back end components (data repositories to process and organize (model) data for business needs for TRANSACTIONAL processing)
    5.DW/BI,  analysis/data mining components (aggregated data to provide business analysis and reporting).
    I've simplified a little bit definitions.
    The point is there are 4 of 5 SYSTEM COMPONENTS that are representing data repositories and data processing parts of any business in internet advertisement.
    Therefore, architecting front end (UI) component and not architecting /modeling data for four other business software layers is a huge mi-stake.
    But unfortunately,in most startups the cost and other factors such as
     time to market requirements
     frequent change of UI
    frequent change of features
    lack of architect role/position as a 'gate keeper'
    lack of resources to develop and test
    etc...
     lead to simplifying architecture to Front end and Back end components.

    Therefore, data is denormalized based on UI funtions as well as all data processing are partitioned based on
    UI functionality. Data processing, data analysis, cross functional reporting, data mining and etc... is becoming very much
    challenging if not impossible tasks.  Lack of abstract normalized data model (sometimes it is called master data for business) really is the 'data crisis'.
    This is the #1 problem.
    Lack of software Architecture based on abstract business model and abstract normalized data model.

    On data side this problem brings inability:
     to optimize data processing
     to speed up transaction processing
     to normalized and optimized data storage repositories (DBs or Filers) for OLTP
     to normalized data (star schema) for analysis and data mining (forecasting) as well as fraud protection analysis
    to integrate external data sources
    to modify and enhance
    to make crossfunctional reporting
    to monitor data and performance of business
    etc....
    it brings snowball of cost to support not mentioning inability to grow business by adding 'new features'....

    yes, lack of data architecture and data modeling by design is the 'data crisis'.
     
    The #2 problem that can define DATA crisis is the lack of technologies to scale the large volumes of data.
    When business is growing the larger data volumes need to be processed and need to be stored. That will be requiring 'special treatment' from architect to come up with platform that can scale data and at the same time have met the requirements on  performance to query data repositories.
    The issue is that each of software components need to be 'ready' , i.e. need to be architected for scale and fast query execution.
    In most start up companies it is not the case by many reasons...some defined above.
    The #3 problem is inability to foresee or to accept and deal with Problems #1 and #2 .

    I did not have much luxury to step into company when the development from scratch had started.
    Most of the time I have served as 'fireman' for companies that are in 'data crisis' situation.

    Cover or fix 'data crisis'?
    Hire contractors to do the hardcoded solutions?
    Hire contractors to blame for failure?
    Fire / hire full time employees?
    Restructure/fire/hire  managers ?
    Start redevelopment as NEW project by adding New data platform development group?
    Hire more managers and developers for a permanent positions to continue support the snowball of problems?
    Start new development and have a strategic plan to move 'old' business flows into NEw one as a step by step?

    Restructure the groups and setting up ownership for certain features/applications/systems?
    Outsource to India or China to reduce cost?
    Outsorce support and start new development hiring or retraining resources?
    Start building data platform/technology inside the company to secure next 4-7 years business growth?

    all of the above has been happening in industry...
    what companies do to solve the above problems or to find tradeoffs?

    Some come up with architecture to preaggregate data to reduce volumes therefore,
    losing raw data that would be needed for analysis and clear understanding of business success.
    It also causing the inability to compete with other companies by adding new features based on analysis of raw data.
    Example is 'targeting' of the ads based on user's behavior.
    If not raw data has been saved in data repositories - no analysis can be done, no 'targeting' feature can be applied to get more $ and improve customer satisfaction and etc...
    Pre-aggregated data will create lots of challenges, as well as adding the cost to keep track of data relationship per transaction - user action.
    Example, when data is preaggregated on account (advertiser) level to calculate amount of money that is left to continue marketing
    of ads by publisher, than the analysis on what/who/where(geo/demo) does view the ad will be not possible to do as raw data per user action (transaction) will be aggregates to level of account.
    which is higher hierarchy level than ad...(company-->account-->campaign-->order--> orderitem--> ad)
    etc..
     The lack of business data model hierarchies and lack of having normalized data model will cost lots of money in the long run...
    some companies - tens millions, some companies - hundreds millions. I work for company that have spend 1 billion for 4 years and still had failed in the end...losing money....guess which one?
     Very expensive data processing, staging, mapping, matching, cleansing software solutions need to be built as a temporary tradeoff.
    But in a long run - the problem of keeping raw data for analysis and optimization to process the data /aggregate data won't disappear.

    Some companies would use staging repositories and place raw data for queries that need to be performed for analysis, basically 'hardcoding' the elements that are needed to be stored in a separate repository....
    Therefore, those repositories can be ad-hoc(ed) to do on fly data analysis.
    It required lots of work and investment to keep those queries running as data is growing and again, tradeoffs need to me made :
    limiting time series to query data, partition data to allow longer history to be queried ,etc...
    lots of work for ETL and database engineers to keep it going...
    etc...

    I can continue to list issues and temporary solutions ...long list...and each time unique implementation...  that I have been participated when dealing with 'Data Crisis' situations...
    But the point is why to create this crisis when any rational is saying - find right architect, spend $ on architecting, and ...you will be in win win situation anytime after 6 months of development effort (life cycle for first versionSmile)...
    May this my 'rational architect' personality can't understand the one day surviving mentality with no future for technology and progress for the company ...

    Solutions?
    Of course, there are temporary solutions to work with existing Vendors to scale by partitioning data flows.
    but again, it brings situation of not flexibility of change, cost to add/modify features, cost of hardware, netowrk traffic jams and etc...

    What to do to avoid DATA CRISIS versus FIXING it ?
    Building Distributed Data Processing Platforms to be able to scale the OLTP and to use Vendors for DW and Data Mining to report/query data.
    ARCHITECT, ARCHITECT, ARCHITECT .
    Cloud computing, or Grid of Computers , is new technology to scale and to speed up queries but the technology (list of vendors and what they up to Smile in my previous blogs) has not been developed to satisfy any customer/application with single vendor's software package.
    There is no 'perfect' mathematical algorithm had been developed yet to be able to optimize Parallel execution of the query, which is the tactic to scale the data and make query performed at best.
    therefore, problem #2 need to be dealt by architecting the solutions for the company based on company's business model and data processing needs.
     No 'golden egg' Vendor software would be found nowadays to solve the Problem #2.
     
    What to do?


    some suggestions:
    Hire experienced (hands-in) Data Architect/technology visioner to review each application flow and start building the plan to reorganize the data operations/ data platforms to be ready in 4 years to go with new technology boom.
    Develop in-house data processes (data platform) for transactional (raw) data.
    Develop unified data model based on abstract normalized data model for your business
    IMprove data processes based on this Unified (master data) model
    identify critical data processes (data elements) to allow company to grow
    Start building custom data analysis systems (bi, monitoring, dw,reporting) based on
       - requirements on latency
       - frequency of change for data elements (richness)
      - dynamicity of hierarchies (relationship between entities and data elements)
      - etc..

    What technologies to use for data to be scaled?

    Again, it is up to architect to make a decision and mitigate the risks. each business is very different by IT situation and business rules
    /data that need to be reviewed from bottom up - from data dictionaries and data models, review existing systems and come up with transition plan.
    I have my thoughts based on a personal experience but it is always great to have data SME for the company's business to be part of brainstorming group to make
    the steps for architecture. There too many vendors that are currently trying to fill up 'scale with computer grids', i.e. distributing processing, distributed DW, distributed networks...
    Several open source projects that can help to start building the distributed data platform based on Mapreduce mechanizm.
    Some DFS systems are already in use but open source does have mostly very basic start to build on top - Hadoop(java based), Kosmos (i prefer it because it is based on C++ libraries) and etc...
    I have found/researched a list of vendors and open source players in distributed - Cloud computing space and put it in my previous blogs couple months ago...

    What is the working architecture that I'll be building for more than 1bln transactions a day with low latency reporting requirements (10 minutes or less)?
    I'd try to build on top of Kosmos and Hypertable for data processing , aggregate data and put into distributed file system to be batched into vendors' OLAP products (Oracle, MSAS, SAP) to query the aggregated data.
    Ideally, i'd like to see the Framework (metadata driven) to support parallel execution of jobs having failover mechanizm, mechanizm to support late arriving data, mechanizm to build que and change priorities of the que on fly, mechanizm to work with more than one cluster (all monitoring features for network traffic and distribute load on nodes not only on one cluster but on several) and etc...
    I do have a full list...that I'd like to continue to work on...
    Good news is that financial crisis gives an opportunity to slow down on throwing money on startups that I have mentioned above
    and may be consentrate time and effort on developing the technologies for data processing on distributed networks and , it would be top of the line , to find/develop the algorith for parallel query execution as the base for all distributed data processing.
     
    So far I'd say that we are in transition period for technologies that need to deal with huge volumes of data.
    Data Crisis is bad news for any company.
    But at the same time 'Data Crisis'  is good news for industry as they will drive, and it has been already happened, the progress in technology of parallel query processing and distributed data storage repositories.
    It is cool to be data architect and  come up with solutions to challenge DATA CRISIS situations.
     





    November 07

    Financial Crises - What to do?


    Media and news are very controversial and misleading in explanation of current Critical condition of USA Credit Market collapse.
    I found that this article , by Igo Baskin includes some interesting simple examples and  explanation how credit system had supported pyramids with no 'real money' secured by FNM , or basically US government. Unsecured loans and etc...Collapsed credit system need to be replaced not mentioning global crizis. Part of which is that the investment into US banks and government papers considered to be secure by other countries national banks. However, it is not secure buying paper from institutions that are broke by issuing not secure loans. etc...
     I am not sure that I agree completely on  'What to do?'  suggestions but overall I like the simplicity of  presenting the info in the article.
    The article is a little bit scary ,with no hope, cruel but rational and realistic -  that is what Russian mentality is about.
    Small thing though - you have to read in RussianSmile

    Мы вступаем в полосу глобального экономического кризиса. Для подавляющего числа людей это будет время крушения привычного, достаточно комфортного, образа жизни и огромных материальных потерь. Катастрофически упадёт в цене недвижимость, пропадут все деньги, вложенные в ценные бумаги, растают как снег на весеннем солнце пенсионные накопления.

    Еврейская поговорка:"Лошадь сдохла - надо слезть."
    Казалось бы все ясно, но...
    не надо уговаривать себя, что есть еще надежда
    не надо бить лошадь сильнее
    не поможет то ,что "всегда так скакали"
    не надо оживлять дохлых лошадей или организовывать мероприятия по их оживлению
    не надо собирать аналистов , чтобы проанализировать дохлую лошадь
    не надо "озивлять", что умерло
    не надо нанимать специалистов, которые помогут сдохнуть другим лошадям
    и т.д.
    ЛОШАДь СДОхла!!!

    ПОРА СЛЕЗТь...
    Angry

    New global financial system need to be built. Old one has died.
    get rid of 'dead horse' and grow up a strong young one - that is the conclusion of Mr. Baskin's article.
    Start growing up the young financial system...
    Confident that it will be happening SOONER than middle class in USA starts to melt down ...










    SSAS 2008 MDX changes

    Vidas Matelis had posted MDX syntax changes as well as list of changes for MSAS 2008.
    Don't miss his poll "What are your plans for MSAS 2008?". Great idea!
    You can find lots of info about MSAS at Vidas's company site

    October 22

    Microsoft Analysis services 2008 unleashed

    New book on Analysis Services 2008 is coming up.
    Great news and pleasant feeling as I have worked with Irina, Sasha and Edward at Microsoft
    I have first book signed by them and will catch them in Seattle to sign the second one.
    Look forward to read and review it.
    October 18

    Forecasting and analytics :SPSS or SAS? Windows or Unix?

    Pls, find very subjective opinion and I'd like to learn more about both packages when implementing models in practics. I've decided to put quick overview of what I've been experiencing so far. Which statistical package to use, and on what platform? SAS or SPSS?
    Small datasets or for novice in statistical modeling - i guess,SPSS might be the better choice. windows version is easier to use than UNIX one.
    SPSS is easier for entering data by user.
    MS Windows version of it is much faster than the Unix version running in X-Windows.
    For researchers with large datasets and more complex statistical analyses, SAS may be the better package.
    Running under either MS Windows or Unix, SAS is currently more powerful than SPSS, as well as more complicated.
    On both systems, SAS now has better graphing capabilities.
    For general data management, SAS possesses certain advantages over SPSS.
    With SAS, it is easier to merge and to concatenate datasets as well as is easier to pipe the output from one dataset into that of another (SPSS).
    It is easier with SAS to take the output of one statistical procedure and feed it into the input of an another statistical procedure.
    SPSS value labels are easier to form than SAS variable formats.
    SPSS is more modular and less flexible in its data management than SAS.
    But for data entry,i think, SPSS for Windows allows for easier input.The number, power, and flexibility of SAS statistical procedures are generally better than those of SPSS.
    For categorical data analysis, SAS offers more tests than does SPSS.
    SAS also contains a wider variety of regression and anova (analysis of variance) procedures than does SPSS.
    SAS Graph far exceeds the current capabilities of SPSS Chart.
    Based on my trial use, preparation for proper usage of the SAS system, with their greater variety of options, involves much more homework than for proper usage of SPSS.
    Overall, it is a common concept that SPSS is more user-friendly, but for the advanced user or statistician, SAS may be powerful than SPSS.


    October 16

    Financial crisis - analytics

    "pigs will be slaughtered" - that is how money managers talk about 'entities' which are failing.
    in nowadays  iceland is the first pig,isn't it?
    who is the next one? take a look at a chart and make your guess.
    Unfortunately,The high tech startups will also see lots of 'reduction of force'.
    TechCrunch is keeping latest info at deadpool.

    October 10

    Microsoft BI Conference 2008

    I did not go this year to MS BI conference in Seattle .
    Some materials are here It is interesting 'battle field' for technologies that deal with large volumes of data (100+millions of rows, hundreds terabytes per day ).
    Open source projects and Linux vendors versus Microsoft's MSAS, Kelimanjaro (SQL server 2010 =SQL srv2008+Zoomix+dataAllegro) , Gemini ,.
    The challenge for BI/DW/Decision support systems to work with large volumes (collect,scale, aggregate, store, query, display) is here. BI/DW success will be really depend on resource competency in the Company (architects/devs/pms) rather than on a single technology to be rely on.
    Based on rumors and limited understanding what is Gemini (MSAS 2010) it sounds like caching mechanizm of cubes in memory to be able to speed up MDX execution against large cubes (100mln rows in fact table).
    Which will be the huge improvement for performance,not sure about reliability: failover, persistancy of updates mechanisms.
    How great it would be if MSAS will also have CONCEPT of scalability (parallel query execution of MDX against several nodes having horizontally partitioned records in the cubes) as new feature for performance improvement of query execution.
    I would love to see the ability to connect to data files and process data into cubes directly from Files residing on  one machine...several machines...
      wouldn't it be great?! ...From several files residing on several nodes(machines) - it would of be awesome!!!

    Looks like Amir Netz is 'back in BI business' and hope to see his 'new baby' as technology killer for BI engines!

    October 01

    CEOs - survival of the UNFITTEST

    What a great metaphor by Mr. Icahn!
    Survivor of UNFITTEST! Icahn Report
    "He <CEO> would never have anyone underneath him as his assistant that’s brighter than he is because that might constitute a threat.
    So therefore, with many exceptions, we have CEOs becoming dumber and dumber and dumber.
    We can all see where this is going.
    It would almost be funny if it wasn’t such a threat to our ability to compete and to our economy in general." strong but very true.

    Look at financial crisis and how many millions of $ the CEOs,some of them being 5 days !!!! on a job, of failed financial institutions are getting on departure.
    Unfortunately, Software industry has plenty of such 'paid for failure' examples as well.

    September 30

    Yahoo's AMP - Advertisers-Publishers Exchanges platform - failure or success?

    I'd like to put some "news available on internet" in order to answer all your questions in one blog about
     1. What is Panama, Right Media,APEX, APM, ATP?
            <S> It is the project that is going on since 2002.
                    The goals were:
                   a) to build better system for Advertisers/Publishers' e-marketing campaigns on internet
                   b) to build DW/BI/Analytics/Targetting platforms -solutions to work  with very large volumes of data - petabytes -
                   c) Improve ROI for advertisers and publishers
                   d) Find new monetizing techniques for yahoo's business

    Simplifying all above -  to build the ad serving and data processing platforms to compete with Google adsense and adwords.

    2. success or failure ?

    september 2008 -->APT VIDEO
    june 2008 -->Apex+RighMedia
    april 2008 -->AMP
    april 2008 -->APEX
    may 2006 -->Panama directions

    Those news are not my personal opinion on APEX team or Yahoo's executives during the attempt to sell off the company.
    However, I can say that I have been through very positive experience in building scalable solutions with very talented software engineers, as well as negative experiences during 'transition period' of yahoo's attempt to sell off and downsizing.
    It is great to see some results of hard work at APT VIDEO.
    I really wish the best to Panama (ATP) team,or what has been left from it, to succeed in contributing to yahoo's goals and objectives in scaling data and in building DW applications that will help to optimize the Advertiser/Publishers marketing efforts.



    September 04

    Web Analytics sites - Compete, Alexa, Quantcast

    Compete looks like have more precise data on traffic than Alexa. But I really like Quantcast that is catching up with Compete and Alexa stats , and I think has started user behavior analytics - suggesting the CATEGORY of the site, USER BEHAVIOR - what sites the user also visited as well as what keywords the user of this site had searched for.
    Compete displays also the keyword that drives the most of traffic to the site however it does not help in targeting right users with right ad campaigns as keyword can belong to so many market segments !!!! Quantcast has shown stats about users by demo,age as well as started to touch PROFILING THE USER trying to assign the visited sites to CATEGORIES (market segmentation) they will likely to visit after visiting this domain/site, therefore, giving possibility to assign Users of this site to certain Market Segments in order to target those users with right advertisement content. "Audience also likes" , for example, I have check marketmetrix.com.
    The business model is in servicing the hospitality industry.
    And Quantcast has given exact match for Business vertical suggesting 3 categories to assign this site to: Hotels, Airlines, car rentals...which exactly identifies the marketmetrix business vertical. Cool! "Audience also visits" - even more information on domain name!!! that users of this site has visited - again very close match to marketmetrix Business Vertical. "Audience also search" - this part looks like out a little bit "out" as keywords that are shown on Quantast are not really much related to Hospitality Vertical. Quantast is definitely has shown the BEGINING OF stats-analytics that help to understand the user behavior on this site and attempt to assign this site to certain market segments based on user behavior/traffic , not just displaying stats on amount of users (traffic) or returning users...

    Where former yahoo execs now?

    TechCrunch posted the List of former execs who left yahoo here
    August 20

    Analytics and AB testing

    How to improve click rate?
    How to bring more users ?
    How to optimize ad serving ?
    How to segment Users per Market and products for better ROI?
    What attributes to include in stats modeling for user profiling?
    How to build better User experiences to attract more site visitors?

    AB testing does help to run numbers and make right decisions based on experiments (part of web analytics/BI/dw).
    Great white paper from MS Experimental Platform (ab testing) on "Seven Pitfalls to Avoid when Running Controlled Experiments on the Web"
    as well  as great blog on AB testing by Andy Edmonds, former MS Live scientist .

    the ability to experiment easily is a critical factor for Web-based applications. The online world is never static. There is a constant flow of new users, new products and new technologies. Being able to figure out quickly what works and what doesn’t can mean the difference between survival and extinction. – Hal Varian (Varian, 2007)
    And this is a huge challenge for middle size or small companies to build Experimental Data Platforms to drive business decisions.
    Data platform means scalable solutions to process and to aggregate data, therefore, it is DW platform. Which is not clearly understood by young entrepreneurs in small startups that data analysis is possible only on top of working data platform: data structures, data flows, data processes, data aggregations, tools and apps to query/display the results of analysis, as well as tools for data mining and stats modeling...
    Again I'd like to emphasize that Analytics and BI can't be done at a FULL accuracy of results as ad-hoc queries from Data that supports business applications, for example social networks UIs , widgets, ad serving and other operational data repositories for business needs/UI needs.
    And this is the major misdirection of vision on building analysis applications with the same architecture approach as for UI apps that drives the Web Display part of the business.
    In order to be able to compete and survive the DW platform need to be built, therefore, technologies/solutions need to be brought in place for data processing
    and data aggregation and data querying, as well as modeling.
    Most companies, small startups that I have priviledge to talk/chat/interview in social networking space, have facing up the issue of not possibility to query volumes of data by ad-hocing, not mentioning the challenges to collect the growing volumes, therefore, need the data platform solutions to scale data in order to continue to grow the business .And this is $ and resources. But very few of startups that I have talked in recent 6 months, I would say 2 out of 18,have the understanding and  vision to build the foundation for data processing and aggregations, i.e. Data Warehouse.
    Another challenge that Agile approaches for building Web Site and adding features to web site frequently have led to highly denormalized back end data structures and making such 'structures' or data models work for analysis is very difficult, almost impossible task , without cleansing/formatting/matching/mapping and etc.. to normalized and to avoid duplicates/redundancy of information.
    only after such data effort, the aggregations can be built based on Questions that will identify the DW solutions: data mining/BI.
      And when this foundation is built  - data analytics/business intelligence/stats can to provide CORRECT results  that will help to build and grow business....
    As this post is about best practices with AB testing, which is part of data analytics/mining tasks set/built on top of DW platform.
     THat means DEDICATED budget and resources, as well as ownership.
    Ownership is a separate topic as many startups having certain mlns of dollars are throwing it into consulting companies versus building inhouse expertise.
    Success of AB testing experiments and display of results will always rely on solid  DW platform/ data foundation that needs to be in place.



    August 18

    How is USA economy doing?How is USA job market in IT doing?


    I implement 'sample analytics' ...just kiddingSmile. no sampling or attributing models...
    The logic is very simple - go to some popular 'IT recruiting' user groups, like "Resumes-in-IT" and follow up the pattern . And? How USA job market in IT is doing ? The  answer  is in RECENT ACTIVITIES:
    Post Activity
      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
    2007                     89 509
    2008 537 319 311 425 135 30 39 33        



    August 16

    Vendors with DW solutions to scale data


    Podcasts are here
    Denomo
    Truviso
    IBM Swashup
    Dataupia
    Intel MashMaker
    StreamBase.
    PS>I won't name Streambase as Vendor for DW but rather would say implementor of Event driven programming.
    In theory, Event is the data structure that serves as INPUT format into processing pipe.
    the richness of such data structure will have direct impact on performance of the pipe .
    Therefore, in reality, Event may corresponds to several data structures in order to optimize the performance of
    data pipes or to support different DW consumer needs for low latency, high availability, traffic picks, critical data flows and etc...
    THerefore, theory that Unified Data structure can be created by Event and can be processed
    in data pipe as is (no need for vertical partitioning) might stay as a goal for architecture but very difficult to achieve
    as, i'd like to say it again, richness of data and volumes of data, as well as frequent changes of requirements on latency,performance, aggregations and etc..
    may lead to designing several data structures based on Event.






    August 09

    Analytics and Video Ads

    Somehow, the simple idea of surveys as SOURCE of data FOR ANALYTICS is kind of forgotten in current boom of Social networks with Video Ads. Before trying to build complex data collection/cleansing/processing/aggregation applications while trying to figure out monetizing paths for video ads ...
    why not to crunch figures from direct users/publishers/advertizers polls? versus trying to 'fish' PV/users/ipaddresses/exclude fraudulent - robots clicks/cleanse inaccurate data from log files/disturb "privacy laws" while matching users' demo/gender stats to results and etc...

    I am applauding the TubeMogul blog - simple and clear ANALYTICS based on direct users' survey!!!!
    Not yet a statistical analysis or modeling/atributing for profiling or optimizing techniques for publishing video ads and etc...but clear path to CLARIFYING the attributes that might help to define the next steps for identifying product classes, narrowing down Group of Users/Publishers, building market segmentation and etc...TO OPTIMIZE ROI for publishing video content. Clear business model.
    This what  Business Intelligence/Statistics/Analytics, first step of course, is for - GET NUMBERS from Questions-answers (SURVEYS) directly from CUSTOMERS and then, based on tracking numbers of PV/users and etc.., start building the suggestions/optimizations and etc.... 
    The numbers from tracking PV/Users(demo,geo,properties) are too wide for making any Stat modeling or even suggest any sampling techniques with probabilities  extended to overall data volumes.
     Plenty of work to do for Data Modeling,Data Mining and Data Analysis teams to build tools to track and to crunch numbers
    in order to helping companies to target right audience with right ads and...etc...
    which is the end results for any ANALYTICS tool/application to get increase of revenue and improve sales.
    And again it is a process, which is part of DW effort, which is not flexible, as I have mentioned in some previous blogs, and required skills, time and $ signs to design/build/support/maintain.
    Web Analytics is a long way to go and , on my opinion, it is neither quick ad-hocing nor simple linear algebra for calculating mean or averages based on stats from PV/users/Clicks(CPC)/Display (CPM).
    Modeling and Stats techniques for marketing and profiling are required and this is the next step , which takes some time to comprehend, for quick growing startups building ad publishings campaigns/advertisement businesses on social networks.




    August 08

    System Design patterns - MVC or Singleton or ?

    There are different  SYSTEM ARCHITECTURE DESIGNS that I have worked through and tried to keep theory close
    to implementation and enhancement cycles... very challenging to preserve Single Architecture techniques/theories when
    building web portals/application with multi-tier design. Mostly, in large corporations it is very difficult to keep ONE PATTERN
    of architecture , MVC versus Singleton. Microsoft .net framework looks like also trying to move toward MVC model. Mostly, it is very much mixture of apps, technologies, integration solutions and etc...
    However, some theoretical directions to have MVC frameworks as solid base of architecture for multi-tier Software Products.
    But in reality, factors like
    dynamicity of changes/enhancements
    integration of old technologies with new applications
    time to market
    quality of service
    performance requirements
    data volumes and richness of data
    cross platforms implementations
    cross vendors implementations
    and many other software and company's structure/infrastructure requirements (owneship, networks, data repositories and etc..)

    drive the architecture as Solutions based on Implementation and Application Business Functions (Artifacts) requirements per custom solution for components-objects in OOD for APPLICATION not per Set of Applications.

    Today with couple of my colleges we discuss the architecture for an application/data flow for System that needs to be scalable and at the same time very flexible on back end storage technologies, Data Querying technologies for DW/ reporting needs. Any SIngle design pattern can be followed up - framework or strict objects/classes that support some functionality for certain Consumer/applications?Performance? what open source projects? what vendors can be used and etc... And it can go for a while ... JAJAH guys looks like hitting the same hot point on Architecture and Design patterns looking into CRM software package and   Amichay Oren shared his thoughts at blog  on Theoretical Design Patterns versus Real Life Development patterns.


    August 05

    Java API layer to access MSAS OLAP CUBES - open source

    Looks like one of my questions that I had researched for a while got answered -
    How to execute MDX from Java to access and retrieve data sets from MSAS Cubes?
    Open Source project olap4j.org
    Thank you Chris (Webb) for publishing the blog about it!

    Cloud Computing and Elastra Cloud Server

    Amazon invested 12mln into Elastra
    Elastra Whitepaper
    Does that mean a "NEW STANDARD SOFTWARE SOLUTION" for Grid Computing to Scale on Unix/linux machines?

    podcasts for cloud computing vendors : RightScale, Elastra anf 3Tera that present the scaling solutions and technical approaches to scale,monitor,control resources, configure nodes, failover and etc...for grid computing.

    administration and automation for administering servers  - is one of the another BIG Questions for cloud computing .
    < a href="http://reductivelabs.com/trac/puppet">Puppet is an open source project that sounds like have implementation in Elastra solution.