Sunday, February 23, 2014

Week 5 Reflections


This week's lecture dealt with Web Analytics. Web Analytics involves collecting data about visitors who come to your website and what they do on your website, and then analyzing that data in meaningful ways.

Specifically, we looked at Google Analytics (GA). GA is popular because it is easy to use: simply insert a small amount of JavaScript code into the page(s) you want to track; it is also FREE, which is probably the main reason why it's one of the leading web analytics tools.

The diagram below (from http://www.ohcpi.com/analytics.html) illustrates how GA works.


  1. Users connect to your website and download site content
  2. When their browser renders the HTML, JavaScript embedded in the page makes a call to the Google servers
    • Since this call originates from the visitor's computer and not your server, it contains information (e.g. cookies) that are specific to Google's domain, as well as information about the page (URL) in which it was embedded
  3. Google assimilates the data it collects from these requests
  4. Google provides reports and tools that can be used to slice and dice the collected data

The lecture discussed the web analytics cycle, emphasizing that analytics is not a one-shot view of things but instead a continuous cycle:

  • set goals -- decide what data you are going to be collecting, what you hope the results of analysis will be
  • measure -- collect the data
  • report -- organize the data into a format that can be analyzed
  • analyze -- examine the data to determine how it measures up to your goals
  • optimize -- make changes as necessary to deal with shortcomings or issues seen
  • repeat! 

The lecture then discussed the Five Ws of web analytics: what who where when why

  • What - actions that are being performed on your website -- what links are they clicking on, etc.
  • Who - the audience, demographics of the audience
  • When - the time, days/hours, time of year, how long do they stay on a particular page
  • Where - geographical areas of visitors
  • Why - are they buying products? reading blogs? contributing reviews? all of the above?

I think that these are useful guidelines and I found them helpful when organizing the GA report I did for my client.

The lecture then discussed some of the measures available and (in the second module) went through some examples of information one can get via GA.

What About Ethics?

I think that one thing that was missing from this week's lecture was a discussion around the ethics of web analytics. At what point are we crossing the line when we analyze information about the visitors to our websites?

In one respect, one could say that when users visit your website, you have the right to track what they do and where they go. All web servers log requests, so the fact that someone is accessing the website and what they download (everything including HTML files, JavaScript, images, and documents) is recorded in the web server log.

However, GA (and other analytics tools) goes one step further -- they have you add JavaScript code to your page(s) and then the visitor's browser makes a call to the Google servers. For security reasons, web browsers are required to only pass cookies to the site from which they originate, but this JavaScript trick is a way to sneak around that "limitation" -- the call goes from the user's web browser to Google's server, so while Google doesn't get any cookies that your site may have set (those are private), it does get cookies that Google has set in some other connection. The returning data also sets cookies in the web browser that belong to Google. So that means that cookies that are set because you did Google searches, or logged into your Gmail account, or went to another site that uses GA are all sent along to Google. Of course, these cookies are not "Google Search Was Here" human-readable pieces of information, but instead are hexadecimal strings that only have meaning to entries in the Google data server.

For example, using a web developer plugin, one can easily see the cookies set for the google.com domain. All of these cookies would be sent to GA when you navigate to a page with the GA JavaScript entry.

What this means is that Google can track a person's behavior across many sites on the Internet. What do they do with this data? We can glean a little information by looking at Google's privacy policy (http://www.google.com/intl/en/policies/privacy), but in reality what they are saying is "trust us".

Ultimately, if one is collecting PII (personally identifiable information) then a line has probably been crossed (at least in some countries).

The following cartoon (from Measuring Success, 2013) compares internet traffic to traffic on a road and the equivalents of collecting information.


This highlights a big difference between doing business online in the European Union (EU) versus the US: the EU has strong privacy laws that require that sites get consent from users before collecting much more than the basic, non-individual data. The data that can be collected without consent is basically the same data you can get by analyzing the web server logs.

So what can we do? What should we do? Well, as individuals, you can choose to mess up GA (or other analytics) by deleting cookies...
You don't have to delete all your cookies, you can choose to selectively delete cookies from particular domains (but this can end up being a lot of work). Personally, I delete cookies pretty regularly and it's just become a habit of mine. The downside to deleting cookies is that you can lose things like "remember my login" on pages you frequent. Bank of America is always asking for my state and for me to confirm my computer!

As designers/architects of web sites, or helping guide those who are, we have to decide if we want to be part of the big information gathering machine that is Google (or, again, any other analytics system, not meaning to pick on Google).

References

"Online Privacy – The Good, the Bad, the Ugly." Measuring Success RSS. N.p., 6 Mar. 2013. Web. 23 Feb. 2014. <http://www.advanced-web-metrics.com/blog/2013/03/06/online-privacy-the-good-the-bad-the-ugly/>

















Saturday, February 15, 2014

MicroStrategy "Tutorials"

O M G

Those MicroStrategy tutorials were THE WORST TUTORIALS EVER. Seriously. Is one supposed to learn something from the ADHD "click here, now click here, now click here" approach? And the so-called "test" for each "lesson" (I'm using quotes very deliberately here!) is to somehow supposed to test your knowledge? About 80% of the time, I click where I'm supposed to (I think? it's the only place that responds to a click) and I get a server error:


UPDATE: so, I tried this out using a different browser and there are pop-ups that guide you through the lesson. OK, so it's slightly better than just seemingly randomly clicking where the big blue arrow points. It would have been *nice* for them to mention that pop-ups were required, or do browser detection to let me know that I wasn't using a supported browser! At least now there's some context to where I'm clicking. And the "tests" direct me where to click (rather than me just clicking around on the screen). I'm still getting the 500 server errors, though. I still think that these are the worst tutorials ever, they just aren't as completely useless as I originally found them.


UPDATE 2: Yay! Done with the "tutorials". I don't think I really learned anything, even though I scored 100% on most of the "tests". Hopefully I'll learn more from the actual assignment.


UPDATE 3: Actually working with MicroStrategy has been much more valuable... It's a very powerful tool and maybe if I had a month or two to use it then I'd actually be able to learn how to use it effectively. The "tutorials", other than letting me know about different buttons, etc. didn't really teach me anything. It's like if someone was teaching you how to drive a car and started out going over all the different parts (here is the steering wheel, here is the brake pedal, etc.). and then you're told "OK, now drive to the store". Frustrating, to say the least.

Sunday, February 9, 2014

Week 3 Reflections

The first lecture this week expanded on the theme established in week 2: namely the design of star and snowflake schemas using fact and dimensional tables. For fact tables, we learned about the different types of facts (additive, semi-additive, and non-additive), and what was really interesting  was the concept of factless fact tables. Factless fact tables can be used to record events or conditions in the data, as described by Datawarehouse Concepts (2012). The basic idea is that a factless fact table ties various dimension tables together to record some event or condition. So, a factless fact table just contains foreign keys of various dimension tables. Very interesting stuff!

With respect to dimensions, we learned about different types of dimensions used in data warehouses:

  • degenerate dimensions
  • role-playing dimensions
  • junk dimensions
  • slow-changing dimensions

Degenerate dimensions are interesting, because they are dimensions that occur in fact tables! While this may seem counter-intuitive, it makes sense when you think about it. These types of dimensions are used to provide information about a particular transaction.
Degenerate dimensions commonly occur when the fact table’s grain is a single transaction (or transaction line). Transaction control header numbers assigned by the operational business process are typically degenerate dimensions, such as order, ticket, credit card transaction, or check numbers. These degenerate dimensions are natural keys of the “parents” of the line items. (Becker, 2003)
For me, one of the really interesting ideas was that of "slow-changing dimensions". I suppose this is because I currently work for a company that produces versioning software, so the idea of tracking how dimensions change over time is inherently interesting. I guess the real challenge here is identifying what dimensions are important to version, how many versions are important to keep around, etc. Choosing incorrectly, especially failing to version a dimension, might prove problematic in the future when someone might decide that they do want a historical view of that dimension. Of course, versioning everything would be ideal, but is it realistic? Especially when considering the storage requirements of versioned dimensions. Margy Ross (2013) outlines various techniques for dealing with different types of slowly-changing dimensions. In a solution that needs to deal with slowly changing dimensions, it's likely that a combination of these techniques would be used.

In the first lecture, we also learned about surrogate keys, and how they can be useful because, unlike primary keys, they do not have "embedded intelligence". An example of embedding intelligence in a primary key can be seen in the Customer table below. Here, while the CustomerID values are unique (and hence primary keys), they have "G" and "S" embedded in them, which indicates the type of customer: "Gold" or "Silver".

Customer ID (PK)First NameLast Name
G100MaryJones
S100BobSmith
G101YvetteLancaster
S101SonjaSpenser
G102MattDawson
S102LarryMelrose

Instead, by adding another column with a surrogate key, we decouple the embedded meaning from the primary key used in operations. For example, introducing a surrogate key into the above table might result in something that looks like this:

CID (PK)Customer ID (PK)First NameLast Name
0001G100MaryJones
0002S100BobSmith
0003G101YvetteLancaster
0004S101SonjaSpenser
0005G102MattDawson
0006S102LarryMelrose

Other benefits of using surrogate keys outlined in the lecture are that they increase operational efficiency and reduce the impact of changes to the 'real' primary key (if Matt Dawson in the example above is demoted to Silver).

The second lecture focussed on data quality, specifically on the use of data profiling to help determine data quality. Outlined was the basic process of data profiling, along with some basic steps involved in the data profiling process:

  1. Creating The Profiling Plan -- planning on how the data will be analyzed -- understanding the nature of the data (tables, columns) and determining how to examine for primary keys, foreign keys, and business rule violations.
  2. Interpreting the Profiling Results -- determining whether the data is high or low in quality and what needs to be done with the data (i.e. cleaning).
  3. Cleansing the Data -- preparing the data for ETL.

Here, I think that it is easy to fall into the trap where you become too reliant on particular tools that are being used. Instead, it is good to keep the 'big picture' in mind and that a combination of tools and techniques will likely be useful for profiling the data. The process may also be iterative, in that the profiling results may lead one to re-run the data profiling software (or use different software) on the datasets to provide another perspective and look at the data from a different vantage point.

The lecture then went on to discuss some examples of data profiling, demonstrating how one would go through the process of identifying primary keys, strange data that needs to be cleaned, referential integrity checks and business rule checks. For performing the initial analysis, tools and automation will likely be needed, as the process can be quite time consuming and error prone. In the lecture, the Gartner Magic Quadrant analysis of various data profiling tools was presented. Each tool has weaknesses and strengths, so some investigation would need to be done in order to determine which tool(s) would be best for a given analysis.

One of the main takeaways from this week's materials is that the quality of the data is paramount. You need to be able to distinguish good data from bad data (and it's not always as easy as just looking for the one in the turtleneck)...
The primary reason that 40% of business initiatives fail is due to poor quality data. Data inconsistencies, lack of completeness, duplicate records, and incorrect business rules often result in inefficiencies, excessive costs, compliance risks and customer satisfaction issues. Therefore improving the quality of your enterprise data will have a huge impact on your business. (IBM Whitepaper, 2012)
The lecture concluded with a discussion of Master Data Management (MDM), and how data quality analysis is a part of this larger process. MDM is defined as "the processes, governance, policies, standards and tools that consistently defines and manages the critical data of an organization to provide a single point of reference" (see http://en.wikipedia.org/wiki/Master_data_management). How an organization approaches MDM can vary greatly, as the size of the organization presents different challenge characteristics (as described by Graham, 2010):

Organization SizeCentral Challenge
SmallSmall amounts of master data. Data integration is not a top priority.
Mid-sizeData integration starts to become difficult for an organization. Data stewards can be clearly defined.
LargeHuge amounts of master data and system integration. Mostly homogeneous data silos with relatively consistent attributes. Data stewards may now have a full time role.
ConglomerateMany disparate businesses that may create many groups of data (i.e., multiple product lines, general ledgers, and so on).

So one question that arises is how MDM fits into the "big picture" of information management within an organization. In his blog article, Weigel (2013) emphasizes Information Governance as a "discipline that oversees the management of your enterprise’s information". He goes on to describe how business goals and information management initiatives can be aligned under Information Governance:


Master Data Management is a key initiative for the success of overall information governance. Ultimately, the business needs to be able to rely on the data in order to make strategic decisions. If the data cannot be trusted, can the decisions based on that data be?

References

Becker, Bob. "Fact Table Core Concepts." Kimball Group. N.p., 3 June 2003. Accessed Web. 9 Feb. 2014. <http://www.kimballgroup.com/2003/06/03/design-tip-46-another-look-at-degenerate-dimensions/>.
"DATAWAREHOUSE CONCEPTS." What is a FACTLESS FACT TABLE?Where we use Factless Fact. N.p., 4 Aug. 2012. Accessed Web. 9 Feb. 2014. <http://dwhlaureate.blogspot.com/2012/08/factless-fact-table.html>.
"Garbage in, quality out. Now that's different.." IBM. IBM, Oct. 2012. Accessed Web. 9 Feb. 2014. <http://www-01.ibm.com/software/info/rte/bdig/ii-5-post.html>.
Graham, Tyler. "Organizational Approaches to Master Data Management." Organizational Approaches to Master Data Management. Microsoft, 1 Apr. 2010. Accessed Web. 9 Feb. 2014. <http://msdn.microsoft.com/en-us/library/ff626496.aspx>.
Harris, Jim. "OCDQ Blog." Adventures in Data Profiling. N.p., 3 Aug. 2009. Accessed Web. 9 Feb. 2014. <http://www.ocdqblog.com/adventures-in-data-profiling/>.
Ross, Margy. "Dimension Table Core Concepts." Kimball Group. N.p., 5 Feb. 2013. Accessed Web. 9 Feb. 2014. <http://www.kimballgroup.com/2013/02/05/design-tip-152-slowly-changing-dimension-types-0-4-5-6-7/>.
Weigel, Niels. "Data Profiling and Data Cleansing - Use Cases and Solutions at SAP." Recent Posts. N.p., 12 June 2013. Accessed Web. 9 Feb. 2014. <http://scn.sap.com/community/enterprise-information-management/blog/2013/06/12/data-profiling-and-data-cleansing--use-cases-and-solutions-at-sap>.









Sunday, February 2, 2014

Week 2 Reflections



Dimensional Modeling

This week's textbook reading on Dimensional Modeling was quite eye opening. At first, my reaction to non-normalized database tables (with respect to the dimension tables) was "noooooooo!", not that I'm a huge fan of crazy ER diagrams...
"my head hurts"

However, the star schema approach described makes a lot of sense. When one focusses on the ease of data access and the ability to accommodate changing requirements, the simplicity of the Fact and Dimension tables and their relationships is compelling. 

The pitfalls outlined at the end of the chapter also seem particularly vital. As a technologist, I find it easy to become excited by the technologies involved and focus on the back-end processing. It's good to be reminded that technologies are just tools and that ease-of-use and front-end processing are more important. After all, you can build the most awesome data warehouse, but if nobody uses it because it is too cumbersome or slow, then what's the point?

The Four Step Process

The four-step process outlined in the lecture materials really helped solidify the concepts of dimensional modeling. Steps 1 and 2 seem especially important, as they guide the rest of the process.

One surprising thing in the lecture was the idea of a "Date/Time Dimension" being present in almost every data warehouse and how it is one of the most important dimensions in a data mart. I understand why a date/time is important, but I would have thought that having a table to contain the values described would be unnecessary. To me, the data in this dimension can be easily and quickly calculated and to actually store this in a table would have been a waste of space. It is easy to convert from UNIX epoch time to things like the week number, like this in Perl:

my $weekNum = POSIX::strftime("%V", someTimeValue);

Of course, I will admit that 'holiday indicator' would be a little more difficult to determine programatically. After the discussion on slicing and dicing, it does make a little more sense to have this information ready and available for use in queries (although I'm not totally sold yet on the idea!)

OLAP Operations

I enjoyed the section on data cube operations (slice, dice, roll-up, drill-down, and pivot) and found that the examples really helped me understand the concepts.

Overall Impressions

The approach to dimensional modeling reminds me a lot of the principles outlined in agile software development (e.g. see Armson, 2012), specifically the ability for the system to accommodate change. Approaching dimensional modeling as an iterative process also harkens to the agile software methodology.

I think that the publishing metaphor at the beginning of the textbook was especially apt, where the responsibilities were framed for the role of data warehouse manager ((chapter 1, page 6). The responsibilities outlined really help drive home the point that data warehouses are there to support the needs of the business and thus need to be business-user focussed.

References

Armson, Kathryn “The Agile Method Explained: Beginners Guide & Summary of Benefits”. Linchpin.com. July 5, 2012. http://www.linchpinseo.com/the-agile-method