So, you want an accessibility score?

We’re often asked if the platform has the ability to give a “grade”.  Currently, it does not, largely because I personally have hang-ups on how to do so in a way that accurately reflects how usable a product is for users with disabilities. Unless it accurately reflects how a system performs for people with disabilities, no grade will be of any real value for anything other than vanity metrics.

Creating a grade for something is extremely simple: Divide the “passed things” by the “total things”, multiply that quotient by 100, then apply the following grouping to the result:

A: 90-100

B: 80-89

C: 70-79

D: 60-69

F: 59 and lower

If you subject your web site to 20 accessibility tests and you pass 15 of them, you get a 75%, which falls under a C grade. Done.


In terms of grading something for accessibility, there are a ton of things wrong with the above idea.

What is the basis for measuring pass vs. fail?

Currently most automated testing tools are unable to give a reliable score because they do not track anything but failures.  Most testing tools have no concept of passing other than by virtue of not failing. In other words, a “pass” condition is created by either not failing the test OR the test being irrelevant.

While there is value in getting a score based on the extent (or lack thereof) of your accessibility errors, it lacks context.

Getting a useful score requires knowing:

  1. What tests were relevant?
  2. Of those tests that were relevant, which ones passed, and which ones failed?

While some may claim that irrelevant things are a “pass”, I find this to be spurious logic. An irrelevant thing can neither pass nor fail because it doesn’t meet the criteria to do either.  To use a computer programming analogy, an irrelevant test would be `null` as an irrelevant thing cannot be `true` (pass) or `false` (fail).

We built this capability into and are moving in this direction with Each test has specific criteria which determines if it is applicable and specific instructions for determining whether the applicable portions have passed or failed. Without this, any “grade” supplied will be inaccurate.

What is the effect of user impact on the grade?

A raw pass-vs-fail score is fine if everything you’re testing for has the same impact, but accessibility is very different. Some things have very different levels of impact for users.

This is very hard to gauge with automation. As I so often say when discussing overlays, it is easy to find images without text alternatives, but it is much harder to determine whether a text alternative is accurate and informative. To make things worse, in cases where the test alternative is wrong, how wrong is it? What is the negative impact of that wrong text alternative? Does it cause the user to miss important information that isn’t conveyed any other way on the page or is its absence not really a big deal?

In addition, some issues impact multiple user types, and those impacts may also vary. How does that play into a score? Should the relative severity of the problem across user types be additive or multiplicative?

At the moment, we do not factor this into the Accessibility Grade generated in but rather into the Prioritization score for each issue (Mortise and Tenon use the same Prioritization scheme). In other words, our approach has been to consider any issue that impacts a user as a failure and the Priority score is simply a measure of urgency with which you should fix each issue so your remediation efforts have a high positive impact for users quickly. That said, I remain open to the idea that this portion of our priority scoring should be its own metric that contributes to the Accessibility Grade, but that brings its own set of challenges that I’ll skip for now.

Should we consider the volume of issues?

At its most basic, the more issues a system has, the lower its quality. In the context of accessibility, the same is true: The higher number of accessibility problems, the lower its accessibility grade should be. However, raw issue count isn’t useful without additional context. This is where Defect Density comes in. Quite simply, it takes into consideration the number of issues vs the size of the page.

The logic for Density’s importance is pretty straightforward: a simple web page with a lot of issues is worse than a complex web page with the same number of issues. Imagine, for a moment, if you tested the homepage and got 100 issues and then tested and got 100 issues. Based solely on issue count vs. page size, the home page performs worse.

Tenon was the first accessibility testing tool to provide Density as a metric for Web Accessibility. In traditional QA, the Defect Density is based on the lines of code and is measured per 1000 lines of code (KLOC). Because Web pages may have many blank lines, we use the Kilobytes of source code as the comparison.

In practice, we’ve found a strong correlation between Density and usability: pages that exceed 50% Density are significantly more difficult for users to deal with. As density increases, so does the likelihood that users will be completely unable to use the content and features of that page, which tends to beg the question as to whether Density is the true metric upon which we should measure a grade.

Should we consider the comparison between pages?

At this point, has assessed millions of pages on the Web and logged tens-of-millions of errors. This is more than enough data for us to calculate any data point we want with a statistically significant sample size, a confidence level of 99% and a confidence interval of 1. Given that, we could provide users with a comparison of their performance against all other Web pages ever tested.

One way to do that is to provide a grade based on the norm or put another way, in comparison against all of the other pages that have ever been tested. One common example of this is grading “on a curve”.

Unfortunately, the “normal” page is pretty bad. Take a look of these error stats, from

  • Min Errors: 0
  • Max Errors: 4841
  • Average Errors: 83
  • Min Density: 0%
  • Max Density: 460%
  • Average Density: 14.7%

In addition to the average of 83 issues per page, the average density of 14.7% suggests that most pages on the Web are quite bad. When it comes to grading for accessibility, it doesn’t seem useful to base a grade on a base norm when that norm is, itself, not acceptable.

How do we score a project, as a whole?

There are several layers to consider in a scoring scenario:

  • The component: an individual feature of a page or application screen, such as its navigation.
  • The page: the entire page or application screen and all of its components
  • The product: the entire collection of pages or screens that make up the product

Getting a grade on a component (or, better, a series of components) is extremely useful in determining the urgency with which you need to make repairs. Getting a grade on a page is a bit less useful, in my opinion, without any specific means of identifying the “value” of the page. A per-page grade is, of course, simple, but an “A” grade on an inconsequential page is less important than getting “A” grades on pages that see the most traffic from users (including any specific features/ documentation/ help for users with accessibility concerns). 

Identifying the relative importance of a page can be quite useful, though I’m not sure whether we’d want that as part of the grade or part of the priority. Adding the page’s importance to Priority would allow us to make smarter decisions on which errors should be fixed sooner whereas adding it to a score does not feel as useful.

This assumes we have a complete set of relevant tests

Whether the assessment being run is automated or manual, the relevance of the grade is directly tied to the completeness and relevance of the test set. In the context of automated testing, it is already well known that automated testing tools cannot test for every possible accessibility best practice.  It definitely pays to use a product that has a large number of tests. For example, has 189 tests in production. Using a product with less tests means you lose the ability to generate a more accurate and relevant grade.

The target grade must be an “A”

Getting a grade that you can look at and immediately understand where your system stands regarding accessibility is an attractive idea. Provided you’re using the right data in the right ways, it should be relatively straight forward to get a grade that is useful.

Accessibility is too often seen as a compliance domain which needs to be tracked. As a result, organizations are doing a bottoms-up race to whatever the bare-minimum grade they need to attain in order to stop being concerned about it. For instance, if an organization happens to regard a “B” as good enough, then that will be their target and they will pursue accessibility no further.

This approach to a “score” is misleading and dangerous. A score’s value should be solely in measuring your distance from a goal and that goal should be full compliance with WCAG.

Conformance to a standard means that you meet or satisfy the ‘requirements’ of the standard. In WCAG 2.0 the ‘requirements’ are the Success Criteria. To conform to WCAG 2.0, you need to satisfy the Success Criteria, that is, there is no content which violates the Success Criteria. (

The at-a-glance ability to see a score and intuitively understand how far away you are from getting a perfect grade is super valuable. Getting a score and choosing a less-than-perfect grade as “good enough” is dangerous when it comes to Accessibility.

Ultimately there’s only One True Metric

There is, however a much more important metric when it comes to measuring accessibility: Will users with disabilities *want* to use the product?

The WCAG standard itself states:

Although these guidelines cover a wide range of issues, they are not able to address the needs of people with all types, degrees, and combinations of disability. (

The real measure requires interacting with the real users, watching them use your product, and asking them one of three questions:

  1. If you are not a current user of this product, would you want to use it?
  2. If you are a current user of this product, would you want to continue to use it?
  3. If you are a former user of this product, would you come back to use it?

 Automated and manual testing is extremely useful in finding potential problems in your product. Only usability testing with real users can tell you if you’ve gotten it right.

Human Resources Concerns: Accessibility of Job Sites

One of the most impactful ways that we can work towards achieving total accessibility on the Web is by improving employment opportunities for people with disabilities. The online job application process should ensure ease of use and comprehension for all of your applicants. Website accessibility policies should always extend to your career pages, application forms, and the like.

If your job applicants experience any accessibility barriers during the online application process, you run the risk of violating Title I of the Americans with Disabilities Act (ADA), which:

Continue reading “Human Resources Concerns: Accessibility of Job Sites”

Tips & Tricks for Testing Accessibility with Assistive Technologies

There are many ways to perform testing for accessibility, each one with their own strengths and weaknesses. Testing with assistive technologies is a great way to get a clear understanding of how your system behaves for real users – assuming that the tester is able to effectively use the assistive technologies they’re testing with. As a kickoff point, here are some valuable tips and tricks we’ve discovered during our own testing experiences.

Continue reading “Tips & Tricks for Testing Accessibility with Assistive Technologies”

Important change to API2 contract

There is an important, breaking change coming soon to the contract for Version 2 of our Test API that we want to alert users to regarding the callbackUrl parameter.

Per the current documentation, callbackUrl is “(an)URL at which results will be POSTed to when testing has been completed”. Real-world usage shows that this feature could be improved and, while we normally avoid breaking changes, we’ve decided to do so in this case in order to avoid confusion for future users of the API. The following details what will be changed:

  1. The callback will be run twice: when the initial POST is made to the API and when the accessibility testing is complete. (Though not documented, this is happening now)
  2. callbackUrl is going away and will be replaced with callback
  3. callback must be an object. For example:

    "callback": {
        "url": "",
        "method": "POST",
        "headers": {
            "X-My-Header": "Value here"

  4. callback, if supplied, must be an object. If it isn’t an object, the API will return a 400 response
  5. callback, if supplied, must contain url. All other properties
    are optional
  6. If callback.method is not supplied, it will default to POST
  7. callback.method will support POST, PUT, PATCH, and DELETE methods.
  8. If Content-Type header is not supplied, we will default it to application/json

If you’re currently using callbackUrl in your API2-consuming code, the only thing you need to change is:

"callbackUrl": ""
"callback": { 
    "url: ""

We anticipate that these changes will be deployed to production in one week, however the timeline depends on changing some of our own existing code for other parts as well in order to support this. To be notified when these changes are deployed to production, give us a shout at

Introducing HTML email testing

If you do email marketing, you know how hard-won each of your email subscribers is. Each email address represents a person who is interested in your product and is either a paying customer or a potential customer. Think of how upset you’d be if you lost 15% of your subscribers instantly. That’s what’s happening when your emails aren’t accessible!

Continue reading “Introducing HTML email testing”

Start your free trial of Tenon today!