Tenon Research first glimpse: The best & worst of content management systems

Since February 2018, Tenon.io has been performing research 24/7/365 on accessibility and publish the raw numbers publicly on our site. The data gathering is ongoing but slow. We analyze logs (from customers who’ve given us permission) as well as crawling the web on our own. We take the logs and analyze the technologies in use on the page’s domain. The ultimate goal, at this point, is just to gather the data to determine whether any of it is useful.

There are well over 1.3 billion websites on the Web and many thousands of different technologies in use. That alone means that there are tens – or even hundreds – of thousands of different possible combinations of technologies in use. When you add in the different versions of each technology that could be in use and the possible differences in configurations of those technologies, you’re certainly well into the millions of possible combinations. For this reason, we don’t know if this research will ever produce anything of value. Our slow rate of data gathering might impact this even further.

Nevertheless, we think it is worth it to try. Gathering the data itself is relatively cheap and involves no human labor beyond the development time needed to make sense of the data. We have already managed to learn a few things about what we’re doing right and what we need to fix to make this better, and we thought we would take a few moments to share some early findings by providing a comparison of content management systems.

This blog post should not be considered as conclusive evidence of which content management system is most accessible. We do not have enough data for such conclusions. If you read carefully, you’ll see that blog post is also filled with disclaimers and exceptions that should be heeded. There’s already too much flawed research in academia on the topic of accessibility. If anything, this post should be considered a discussion of possible flaws.

The data we discuss later in this post is available as a Google Spreadsheet

Methods

Whenever an API call is made to Tenon to test a page, we also check the user’s preferences to determine whether they’ve provided consent for anonymous statistics gathering. If they have, then we mark the test record appropriately for later analysis. Behind the scenes, another process exists to pluck those marked records to determine what technologies are used on the tested site and map them to our list of technologies. This process provides us with our first two challenges: Just because a domain has a specific technology in use, that does not mean that the page itself uses that technology and also does not mean that technology has any specific accessibility impact. We hope that larger and larger data gathered will help add reliability in such cases.

The data we currently track is limited to the following:

  • Errors per page
  • Tests passing per page
  • Tests failing per page
  • WCAG Conformance at each Level (A, AA, AAA)
  • Issue Density

Of the above list, we feel that some data points are more important than others. The most important item on that list is Issue Density, closely followed by Tests failing per page. The others are not really useful in our view. For instance, Tests Passing can be misleading because it does not take into consideration the number of tests that are irrelevant on that page. WCAG conformance is a poor measure of accessibility and errors per page is far less relevant than issue density. As we’ll discuss later, there’s a lot more data we can track that will enable us to make more informed judgments.

Preliminary glimpse at usefulness

Among the most important things we realized is that comparing any specific technology against the Global data is not useful. Global data has massive deviation, whereas comparing similar technologies should be more reliable. To that end, we decided to take a look at content management systems.

The content management systems we decided to compare were:

Technology Rank on Web Rank on Alexa Top 10k Popularity on Tenon
Adobe Experience Manager #24 #5 #5
Drupal #6 #2 #2
Kentico #14 #43 #9
Magento below top 50 below top 50 #4
Sitecore #20 #6 #7
Telerik Sitefinity #4 #24 #10
Zope below top 50 below top 50 #11
Ektron below top 50 below top 50 #8
WordPress #1 #1 #1

For giggles, we decided to add two more: WordPress w/ JetPack and WordPress w/ Genesis Framework since they’re also very frequently seen on Tenon

Scoring

For each of the data points we listed earlier, we tracked:

  • Minimum
  • Maximum
  • Mean
  • Median
  • Mode
  • Range
  • Standard Deviation

Of those, we then took note of which technologies were: Best, Average or Better-than-average, Worse-than-average, and Worst for the Minimum, Maximum, Mean, Median, and Mode. From there, each CMS was scored based on the number of times they were placed in each category:

  • Best: 2 points
  • Average or better-than-average: 1 point
  • Worse than average: -1 point
  • Worst: -2 points

Final Scores and Ranking

The final rank and scoring is listed below, from best to worst

  1. WordPress w/ Genesis Framework: 45 points
  2. Telerik Sitefinity: 38 points
  3. Zope: 30 points
  4. Ektron: 26 points
  5. WordPress w/ Jetpack: 24 points
  6. Kentico: 18 points
  7. Drupal: 14 points
  8. Adobe Experience Manager: 11 points
  9. WordPress: 7 points
  10. Sitecore: -17 points
  11. Magento: -21 points

Observations

This method of scoring seems to work well

Skimming through the data, the rankings seem to make sense. WordPress w/ Genesis Framework had the highest instances where it scored the best in the various data points and was never “Worst” or “Worse-than-average”. The converse is true for the lowest ranked CMS, Magento, which was never “Best” and never “Better-than-average”. The distribution of 2, 3, 4, 5, and 6 make sense due to their balance of scores orienting toward the averages, and the distribution of 8, 9, and 10 also make sense as they were consistently below average or worst in some areas. Adobe Experience Manager, for instance, was never “Best”, never “Worst” but closely split between “Worse-than-average” or “Average” in each measure.

Challenges

Sample Size

We have not yet seen enough of any of these technologies for this data to be statistically significant. The product we’ve seen the most is WordPress, and we’d like to see that probably twice as many times before we can consider the data significant – and the more the better. The others haven’t been seen anywhere near enough times to have much confidence in the data.

Quality/ experience of site developers is unknown

Although we know the overall popularity of the specific technologies, we aren’t yet tracking our own data closely enough to know the popularity ranking of any of the specific sites in our sample. For instance, just because WordPress is the most popular CMS among the Alexa top 10k, that doesn’t mean our sample includes any of those top 10k sites. This is important to keep in mind because a site’s popularity has bearing on how professional the site’s developers are. With a CMS like WordPress or Drupal, the site could’ve been built by a hobbyist or small development shop or it could’ve been built by a major company. With Adobe Experience Manager, it was almost assuredly professionally built because the product is expensive. That doesn’t mean professional developers are any more likely to know much about accessibility, but the expectation is that their overall skill level would (hopefully) be higher and that we can expect them to have higher familiarity with the technologies they’re using.

Variations caused by templates/ themes/ plugins/ modules

The Accessibility of a website is more closely tied to the front-end of the website than it is the back end. There is no inherent difference between the ASP.net that drives Telerik Sitefinity than the PHP that drives Drupal or WordPress. Ultimately, each CMS presents a website that renders in HTML, CSS, and JavaScript, and these technologies are what matter. The themes and templates developed and used by the site’s developers are where a lot of accessibility issues reside. Plugins and modules often cause changes to the existing templates or may have templates of their own.

There’s no way for us to track that information in a way that is reliable or accurate. Developers might create their own templates from scratch, might use a theme starter, or might use a theme straight out of the box. Modern content management systems are so flexible that some might say that it is not reasonable for us to form any conclusions from this data. However, we think that over time we will be able to see patterns emerge as sample sizes increase. For instance, WordPress used to be rather well-known for putting unnecessary title attributes everywhere (Drupal still does). WordPress also has a bad habit of automatically linking to larger versions of images (the image link does not provide a way to add a text alternative for the larger view). Similar patterns may arise over time for other technologies.

No indication of specific patterns or issue types by technology

Ultimately this research would be much more useful for people if we are able to indicate what the most common errors are. For instance, when it comes to content management systems, if there are certain error types that happen most frequently, potential customers can then do their own research to determine whether those error patterns can be circumvented in other ways or if they’re inherent in the system.

The limitations of automatic testing

No discussion of this type of research can escape the reality that there are limitations of what can be tested and how. Any given technology – especially with content management systems – could have accessibility issues that are undetectable via automation but completely block users with disabilities. This is especially true when it comes to error handling on forms or focus management issues, neither of which can be reliably tested via automation.

Nonetheless, with a large enough sample size we hope to get enough detail to warrant some reasonable assumptions, especially when it comes to outliers. By that, we mean that technologies that perform significantly better or worse than their peer technologies in our automated testing are likely to also perform similarly under manual testing as well. Fundamentally, automated accessibility testing is a test of quality. The extent to which a site fails automated testing is an indicator of how it would perform under manual review.

We don’t know if the technology itself is the direct cause of accessibility problems

While this particular comparison focused on content management systems, we’re also tracking scores of other technologies. In all cases, we can’t truly know whether a particular technology is causing the problems discovered. All we can know (with a sufficient sample size) is the correlation between errors and the technology.

Next steps

Prior to this focused comparison of content management systems, all we really had were numbers. We had no idea if any of this was just taking up disk space. Now, it looks like it might be useful as long as we can gather up enough data. A couple of things need to be improved.

Complete deletion of all research data and a fresh start

We’re very close to releasing a large pile of updates to Tenon’s accessibility tests. The extent of the changes will conflict with all of the current data. We will have almost 3x as many tests as we have now, more coverage for ARIA, and some important bug fixes. The current data has to go to ensure accuracy.

Global statistics should only include those API stats for which the technologies are known

We didn’t include a comparison of these technologies vs. global data because the global data is all over the place. It includes data from partial page tests and data from domains for which we don’t know the technologies. As stated earlier, it also has massive amounts of deviation. On this latter point, there’s not much we can do. The data it what it is. We can still make some changes to how we measure it to make it more fair by only comparing tests whole pages and comparing logs that have known technologies.

Further granularity of data

Finally, we want to make the data more useful. Raw issue counts – even issue counts by WCAG Level – aren’t actionable. What types of problems are they? How bad are they? We have this data, but we aren’t tracking it in a usable way. We hope to be able to track the following.

  • Research stats by technology should have issue breakdown by priority
  • Research stats by technology should have issue breakdown by certainty chart
  • Research stats by technology should have issues broken down by content category
  • Research stats by technology should have issues broken down by specific test
  • Research stats by technology should have issue breakdown by WCAG SC
  • Research stats by technology should have top errors on domains that use that technology

Come back this summer for a post on E-Commerce Platforms

Once we’ve updated our new tests and wiped the previous research data, we’re going to start anew, aggressively testing the Alexa Top 1 Million sites. We anticipate doing another blog post in about 3 months, this time about E-Commerce Platforms.

Post a Comment