Morgan Ames
Distributed Mentor Project 2003

Final Report

NOTE: a paper including research conducted after my summer internship was officially over was submitted to CHI 2004 in October 2003. Check my research page for more information!
Introduction
Usability studies are an important part of the software development process [15]. Some applications are built for groups of people who are in remote or distributed locations. The travel involved in doing in-person evaluation with the remote users of these systems is often prohibitively costly. One common alternative is to find local "representative users" who match the demographics of the remote users, and use them as participants in local usability studies. However, this may not be feasible if the interface is for specialists, or if the culture of the target users is sufficiently different from the local culture. In these cases, remote usability studies, where the study practitioner and participant are not co-located but interact either directly or indirectly over a network, could be conducted instead.

Remote usability studies have the benefit of often being in the participant's normal setting, thus giving a more realistic test of the interface. In addition, they can lower the costs of evaluation and can potentially provide data from large numbers of participants [11, 12, 16, 17]. However, a remote study practitioner lacks subtle cues and contextual information such as facial expresssions [11, 16], and it can be more difficult to establish trust between practitioner and participant remotely than it is in person [2].

Remote usability studies can be divided into synchronous and asynchronous studies. Synchronous remote usability studies simulate local studies, where a participant and study facilitator are directly communicating in real-time. In contrast, in asynchronous studies participants provide comments or consent to have their actions logged, and have little or no contact with those running the study.

Though synchronous studies are often done in the field, they have not been as well researched as asynchronous studies. In this paper, we will present one step in this verification, and outline possible future work in the area. First, to frame our study we will discuss other research that has been done in both synchronous and asynchronous remote usability. Then we will describe the method of our study. We will follow that with a presentation and discussion of the results, and then discuss future directions of the work.
Related Work
Remote usability encompasses a variety of practices including synchronous video conferencing (simulating a local study as closely as possible), asynchronous user-initiated critical-incident reporting, remote questionnaires, and automated data collection and mining. As with local usability studies, remote usability studies can vary in many ways: participants and tasks can be real or representative, problems can be identified by participant or evaluator, the study location can be controlled or real, and the equipment used and types of data collected can vary [4].

For the purposes of framing our study, we choose to separate remote usability methods by synchronicity. We will first briefly survey the research areas in asynchronous remote evaluation methods, including critical incident reporting and automated data collection. Then we will discuss the work done in synchronous remote evaluation methods, and outline the motivation for this study.

Asynchronous remote usability studies separate the study facilitators from participants in time as well as space, and use a variety of automated logging options and participant-initiated feedback tools. Many methods are especially well-suited to webpage evaluation [6, 15, 16, 17]. There have been a number of studies done on various forms of asynchronous techniques.

User-initiated reporting is one popular form of asynchronous studies, enabling study participants to provide subjective comments while they use the system in question. Dunckley et al. [7] have shown that using a written think-aloud (or "write-along") method in an asynchronous remote study provides similar results to think-aloud in synchronous remote studies, in terms of the average number of usability problems found per participant and participant satisfaction.

Hartson et al. compared user-initiated critical-incident reporting, in which users identify and report problems, to expert analysis of recorded usability studies, and found that experts found few problems beyond those participants identified [13]. They also found that critical incident reporting is further bolstered by including contextual information such as screenshots with a report [5].

Automated data collection is another popular form of asynchronous remote usability studies, and has the benefit of collecting data from a potentially large number of users. Systems such as WebQuilt [15], WebVIP and VisVIP [18], Vividence [16], and NetRaker [6] provide tools to facilitate automated data collection and analysis. However, various researchers [19, 20, 14] have found that data collected automatically can be difficult to contextualize and is often too low-level to be useful, and suggest that it should supplement other study techniques.

Synchronous remote usability studies are accomplished with a variety of media, including telephone, recorded audio or video, networked audio or video, online text chat, and screen-sharing. Common tools for synchronous remote usability studies include VNC and other screen-sharing programs; WebEx, NetMeeting and other conferencing tools; and WebCat, a web-based category testing tool [11, 16, 18]. In contrast to the large number of comparative studies done on a variety of asynchronous methods, we are aware of only one comparative study of synchronous methods.

Hartson et al. [13] compare synchronous remote evaluation from a satellite usability lab via telephone, scan-conversion screensharing, and automated data capture with local (next-room) evaluation on the Kodak website. The eight participants in the study, who had no prior experience with web browsers, completed five tasks and answered nine Likert-scale questions while an evaluator noted usability problems. Hartson et al. found no significant difference in number of problems found or participant feedback between the two conditions.

We find it important to follow up on this work for several reasons. First, we conducted same-room local studies, using the Boren-Ramey think-aloud method [1], rather than next-room local studies, which in themselves add an element of remoteness. In addition, in our remote studies we did not use any special hardware as Hartson et al. did, instead using off-the-shelf screensharing software and telephones.

Method
All studies evaluated a new graphical interface for UrbanSim [21], an open-source simulation tool for urban planners being developed at the University of Washington. We chose to do simulated remote studies to attempt to study the importance of certain characteristics of remote studies, namely communication over the telephone and screensharing, without the added complications of varying computer configurations, network speeds, and participant locations. To control for facilitator variation, the same facilitator was used for all studies in this set.

We recruited nine computer science students from the University of Washington and researchers from Intel Research, screened to have very little or no experience with UrbanSim's platform Eclipse [9]. Four participants were randomly selected to start with a local study, and the other five started with the simulated remote study. Participants were provided $20 cash as compensation for two 1 to 1.5 hour studies.

Both the local and simulated remote studies took place in our labs. In the local study, the facilitator and observer were in the same room as the participant. In the simulated remote study, each participant communicated with the study facilitator, who was in another room in the same building, via conference telephone and shared their computer screen with Glance [10], a VNC-based screensharing program. The participant was greeted by another researcher, and had no face-to-face contact with the facilitator at any point during the remote study.

For all studies, pre-study setup and post-study debriefing was read from a script prepared during pilot runs, to ensure consistency between studies. At the beginning of each study, the facilitator greeted the participant and gave him/her a consent form to sign. Then the facilitator, following the Boren-Ramey think-aloud protocol [1], set the expectations of the study and coached the participant in how to think aloud, first providing a description and short example, and then having the participant practice on an easy sample task. The participant was then asked to complete an online pre-study survey and to signal to the facilitator when finished.

The facilitator then invited the participant to imagine that he/she was a technical intern for an urban planner who is investigating UrbanSim, and gave him/her a number of tasks, presented one at a time, to perform on UrbanSim and related software such as Microsoft Access. We had two sets of tasks: one for the first study, and one for the second. In each study, an observer helped the facilitator take notes on the participant's actions during these tasks, and the participant's voice and computer screen were recorded by a tape recorder and Camtasia Recorder [3], respectively.

During the tasks, the facilitator gave minimal responses in the form of "uh-huh" or "mm-hmm" to indicate attentiveness. Participants were reminded to continue thinking aloud with "and now?” if they fell silent for at least three seconds. If they expressed significant frustration the facilitator intervened with a suggestion such as "Could I ask you to read the task description again?” If the system crashed, the facilitator explicitly suspended the study, resolved the problem, and continued the study by suggesting a starting point and explicitly resuming think-aloud. All reminders, interventions, and suspensions were recorded.

After the participant completed the tasks, the facilitator invited him/her to reflect on where each task was particularly difficult, and what could be improved. The participant was then asked to complete a post-study survey. In the second study, the participant was also asked to complete a survey comparing the local and remote studies. After these surveys, the participant was given final information and contact information for the researchers, and was given compensation for participating.

Results
The nine participants found on average the same number of problems in the local study (4.3 per participant per study) and in the remote study (4.5 per participant per study; p=.87). An independent-samples T-test shows that mean task times, number of interventions, and number of suspensions were also very similar, as is shown in Table 1. Table 2 shows that more participants successfully finished each task in the local case (85% to 65%, p=.02), and there were more preemptive finishes (where participants think they were finished but were not) and slightly more system crashes in the remote case.

Variable Local mean Remote mean p
Startup time 6.0 6.0 1.0
Time per task 12.1 12.9 .81
discussion time 6.2 8.1 .27
Total problems found 4.3 4.5 .87
Interventions .85 1.2 .57
Suspensions .15 .17 .87
Table 1. Average times, problems, and facilitator disruptions for local and remote tasks.

Finish status Local Remote p
Successful finish 85% 65% .02
Preemptive finish 11% 26% .28
System crash 3.7% 8.7% .56
Table 2. Percent of local and remote participants who finished, weren't able to finish because of system crash, or preemptively finished.

Table 3 shows that participants who filled out the after-study survey did not strongly prefer one study to the other in terms of convenience, distractions, comfort talking, ease of think-aloud, contribution to the redesign of UrbanSim, and willingness to do a similar study in the future. Seven remote participants were "very comfortable” and one was "somewhat comfortable” sharing their computer screen during the remote study. However, 4 of the 9 participants found it easier in the local study to recall and discuss the tasks after the study, and 3 felt that the evaluator was more interested in what they were saying during the local study.

In the survey, we also asked participants what kind of monitoring they would accept during a remote study. Their answers are summarized in Table 4.

Question Local mean Remote mean p
Do you feel like you've contributed something important to the redesign of the UrbanSim interface in this study? (1="yes", 2="no") 1.2 1.0 .39
How comfortable were you with talking to the evaluator on the phone during this study? (1="very comfortable", 2="somewhat", 3="not very", 4="not comfortable") 1.4 1.6 .72
How easy was it to remember to keep 'thinking aloud' during this study? (1="very easy", 2="somewhat", 3="not very", 4="not easy") 1.7 2.0 .59
In the discussion at the end of the study, how easy was it to remember what you were thinking during each task? (1="very easy", 2="somewhat", 3="not very", 4="not easy") 2.1 2.6 .18
How convenient was this usability study for you, in terms of time spent, effort traveling, and setup? (1="very convenient", 2="somewhat", 3="not very", 4="not convenient") 1.4 1.4 1.0
How distracting was the environment where you did this usability study, in terms of difficulty in establishing and maintaining concentration on the task? (1="very distracting", 2="somewhat", 3="not very", 4="not distracting") 3.6 3.7 .75
How willing would you be to do a usability study of this kind in the future, either for UrbanSim or for other projects? (1="willing", 2="willing, with some reservations", 3="not very willing, but could", 4="not willing") 1.4 1.4 1.0
Table 3. Results from the post-local and post-remote surveys.

Type of monitoring Consent (N=9)
Recording the telephone call 8 (89%)
Remotely viewing your computer screen (as in this study) 9 (100%)
Recording your computer screen 9 (100%)
Recording input events (mouse movements, keystrokes) 8 (89%)
Eye tracking 6 (67%)
Recording your screen and hands with a video camera 7 (78%)
Recording your facial expressions with a video camera 5 (56%)
Table 4. Remote participants were asked, "If you participated in a remote study in the future, what kinds of monitoring during the study would you accept?"

Conclusion
We find no significant differences in either the number of tasks found during between local and simulated-remote studies. Participants finished tasks successfully more often, found it easier to recall tasks, and thought the facilitator was more interested in the local study, but these differences did not affect the number of problems found, study times, or participant preferences between remote and local studies. We plan to follow this study with a comparison of local and truly remote studies, to find out if our simulation of remoteness was realistic.
References
  1. Boren, T., and Ramey, J. Thinking Aloud: Reconciling Theory and Practice. IEEE Transactions on Professional Communication, Special issue on usability research methods, pages 261-278. September 2000.
  2. Bos, N., Olson, J., Gergle, D., Olson, G., and Wright, Z. Effects of Four Computer-Mediated Communications Channels on Trust Development. In Proceedings of Conference on Human Factors and Computing Systems, 2000.
  3. Camtasia. http://www.techsmith.com/products/studio/
  4. Castillo, J.C., Hartson, H.R., and Hix, D. Remote Usability Evaluation at a Glance. Technical Report TR-97-12, Computer Science, Virginia Polytechnic Institute and State University. 1997.
  5. Castillo, J.C., Hartson, H.R., and Hix, D. The User-Reported Critical Incident Method At A Glance. Technical Report TR-97-13, Computer Science, Virginia Polytechnic Institute and State University. 1997.
  6. van Duyne, D., Landay, J.A., and Tarpy, M. NetRaker Suite: A Demonstration. In Extended Abstracts of Conference on Human Factors and Computing Systems, pages 518-519. 2002.
  7. Dunckley, L., Rapanotti, L., and Hall, J.G. Extending Low-Cost Remote Evaluation with Synchronous Communication. In Proceedings of HCI, September 2002.
  8. Ebling, M.R., and John, B.E.. On the Contributions of Different Empirical Data in Usability Testing. Proceedings of DIS, August 2000.
  9. Eclipse Project. http://www.eclipse.org/eclipse
  10. Glance Networks. http://www.glance.net
  11. Gough, D. and Phillips, H. Remote Online Usability Testing: Why, How, and When to use it. http://www.boxesandarrows.com/archives/ remote_online_usability_testing_ why_how_and_when_to_use_it.php
  12. Hammontree, M., Weiler, P., and Nayak, N. Remote Usability Testing. Interactions, pages 21-25. July 1994.
  13. Hartson, H.R., Castillo, J.C., Kelso, J., and Neale, W.C. Remote Evaluation: The Network as an Extension of the Usability Laboratory. In Proceedings of Conference on Human Factors and Computing Systems, pages 228-235. 1996.
  14. Hilbert, D.M., Redmiles, D.F. Separating the Wheat from the Chaff in Internet-Mediated User Feedback. ACM SIGGROUP Bulletin, Volume 20, Issue 1, pages 35-40. April 1999.
  15. Hong, J.I., Heer, J., Waterson, S., and Landay, J.A. WebQuilt: A Proxy-based Approach to Remote Web Usability Testing. In ACM Transactions on Information Systems. 2001.
  16. Olmsted, E. and Horst, D. Remote Usability Testing: Practices and Procedures. Workshop at Usability Professionals Association conference, June 2003.
  17. Ratner, J. Learning About the User Experience on the Web With the Phone Usability Method. Human Factors and Web Development, 2nd edition. October 2002.
  18. Scholtz, J. Adaptation of Traditional Usability Testing Methods for Remote Testing. In Proceedings of Hawaii International Conference on System Sciences. January 2001.
  19. Tamler, H. High-Tech vs. High-Touch: Some Limits of Automation in Diagnostic Usability Testing. User Experience, pages 18-22. Spring/Summer 2003.
  20. Tullis, T., Fleischman, S., McNulty, M., Cianchette, C., and Bergel, M. Empirical Comparison of Lab and Remote Usability Testing of Web Sites. In Proceedings of Usability Professionals Association Conference, July 2002.
  21. Waddell, P. UrbanSim: Modeling Urban Development for Land Use, Transportation and Environmental Planning. Journal of the American Planning Association, Vol. 68 No. 3, pages 297-314. Summer 2002.