The introduction and success of volunteered geographic information (VGI) has gained the interest of national mapping agencies (NMAs). NMAs are looking to this volunteer information to maintain their topographic databases; however, the main concern is the quality of the data.
Globally, there is an increasing demand for spatial data. NMAs are therefore constantly looking for ways to improve their mapping process. While official mapping is declining worldwide, the demand for spatial data is rising . Instead of competing with open initiatives, NMAs are realising that open initiatives present a great opportunity for collaboration . This study is focused on the integration of VGI into authoritative data in South Africa. OpenStreetMap (OSM) was chosen as the source of VGI as it is easy to obtain the vector data without any costs. The CD:NGI was chosen as the source of authoritative geo-spatial data. The feasibility of integrating the two data sets was determined by assessing the quality of OSM data with reference to the CD:NGI geo-spatial data standards.
For this investigation, the internal quality was measured by assessing the positional and semantic accuracy of roads, the geometric accuracy of amenity buildings and the completeness of roads. The internal quality refers to what the data is really like and is not dependent on the perception of the user .
This article is divided into seven main sections. Section two highlights previous investigations into the quality of OSM data. Section three presents some of the problems that hinder the integration of VGI into authoritative data. The methodology used to assess the quality of OSM road and building data is given in section four. In section five the results are presented. Sections six and seven contain the implications and conclusions of the results, respectively.
Previous quality assessments of OpenStreetMap data
In 2009, one researcher investigated the positional accuracy of the OSM road network by comparing it to different authoritative data sets . Thereafter, other researchers performed various quality assessments, including the completeness, geometric accuracy and attribute accuracy [5, 6]. The varied results obtained in these studies show that the geographic location and extent, the reference data set, the method and execution of testing, and the date of the OSM data extraction influence the results considerably. However, generally it was concluded that:
Two quality assessments on the South African OSM data was done in 2012 [8, 9]. However, the motivation for this study was to compare the OSM data specifically to the CD:NGI data and to increase the size and number of test areas.
Integrating VGI and authoritative data
Many researchers believe that any level of VGI ingestion will result in reduced map production costs . Other researchers believe that there are costs involved in re-aligning the workflow to incorporate VGI, and other costs, like internet access necessary to obtain the latest updates and training staff in VGI usage . Another challenge is government mapping agencies’ acceptance of spatial data produced by non-experts . It is a known fact that the public deem authoritative spatial information to be without error, even if this is not necessarily the case. The opposite holds true for most, not all, NMAs – they do not deem volunteer information to be of high quality. It has been said that the reason for government mapping agencies’ reluctance to accept volunteer information is the legal implications associated with incorrect information disseminated to the public .Other technical issues to consider when integrating VGI data into NMA data include different reference systems, different representation of features, how to handle duplication of features and different file formats. There is also the physical structure of spatial databases and naming conventions to consider. This discussion will be limited to data structure issues.
There are distinct differences between the data models of authoritative and volunteer data in the way the data is acquired, stored and the policies and standards governing the data. The differences between the CD:NGI and OSM data models are listed in Table 1.
Examples of integrating VGI and authoritative spatial data
Methodology for determining the quality of OSM data in South Africa
The aim of this investigation is to determine how well the OSM data compares with the CD:NGI data quality requirements. The CD:NGI standard for the Capture of Topographic Data states: i) features captured by photogrammetric methods must have a positional accuracy not exceeding 10 m at the 95% confidence level, and ii) features shall be correctly classified at the 90% confidence interval. The standard does not include accuracy statements for the other quality aspects. The OSM data was thus assessed with respect to the positional accuracy and semantic accuracy. In addition, the completeness of roads was assessed. Due to time and technical constraints, other quality elements (e.g. thematic accuracy) were not investigated. This section presents the method for determining the positional accuracy and semantic accuracy of roads, the geometric accuracy of amenity buildings and the completeness of roads.
Data sources and data preparation
The OSM data sets were obtained in shapefile format and covered the period from October 2006 to April 2012. The reference vector (in shapefile format) and raster data were obtained from the CD:NGI. In order to overlay the OSM data onto the CD:NGI data, the OSM data was projected onto the Transverse Mercator projection type. Twenty-seven test sites were chosen throughout South Africa, three test areas per province (see Fig. 1). Each test area represented high urban density or low urban density or commercial. Before quality tests could be done, the OSM data had to be preprocessed. Essentially, the quality assessment will be used to determine the feasibility of integrating the CD:NGI and OSM data. Thus, the OSM data was transformed to resemble the CD:NGI data. The sections below describe the preprocessing of the OSM data.
Computing the positional accuracy of roads
The method by Goodchild & Hunter  was chosen to compute the positional accuracy of OSM road features because: i) no assumption is made about the accuracy of the test data and ii) it is insensitive to outliers. According to this method, the positional accuracy may be determined by the percentage of OSM road that is within the buffer of the corresponding CD:NGI road feature . For this investigation, the buffer width was set to the CD:NGI stated positional accuracy of 10 m (see Fig. 3).Computing the geometric accuracy of polygons
The Hausdorff distance was computed to find the positional accuracy of polygons. The Hausdorff distance gives the “maximum distance of a set to the nearest point in the other set” . In addition, the ratio of the areas and compactness were computed for shape comparisons:
Compactness measures how regular a polygon is, in other words, how much a polygon deviates from a predefined shape (e.g. a square) :
Incorrect polygon matches were identified by a large Hausdorff distance. Each outlier was inspected manually in order to confirm an incorrect match.
Computing the semantic accuracy of roads
Semantic accuracy refers to how well volunteers are able to name or identify the features they digitise. In terms of the CD:NGI and OSM road classes, the semantic accuracy is determined by how often a feature is classified as the same type of feature in both data sets. The class names differed; therefore, it was necessary to determine which road classes defined the same type of features between the two data sets. The biggest data set, the Western Cape commercial data set, was chosen to determine how many times an OSM road class matched a certain CD:NGI road class for corresponding road features. Based on these results as well as the road class definitions, three road matches were selected (see Table 2). The positive road class matches were then used as a standard to assess the semantic accuracy of the other test areas.
The completeness of a data set can be defined as the omission of data and the presence of excess data . For this investigation, the completeness refers to how much of the CD:NGI roads exist within the OSM data set for a given area. The completeness was computed by dividing the total length of the April 2012 OSM road data set by the total length of the CD:NGI road data as at April 2012 .
Positional accuracy of roads
The CD:NGI roads were buffered and the percentages of matching OSM roads within these buffers were computed. The percentage match for each test area was computed. These were then used to compute the weighted averages per province first and then for each settlement category. The count of features was used as the weight and was computed in order to remove any bias. The weighted percentage overlap for each province is presented in Table 3. The percentages for the nine provinces are in the range of 64,8 to 94,3%. Five provinces have a percentage greater than 80% and four out of the five have percentages within 10% of the CD:NGI requirement. Considering the 5% error within the CD:NGI data, these four provinces may have an absolute accuracy of 95% and greater. The bigger and more developed cities tend to have the highest percentages. The location, which is linked to the number of features, also influences the average percentage overlap. The remaining non-overlapping percentages are due to incorrect digitising in the OSM data set or omission of feature detail in the CD:NGI data set or missing road sections in the OSM data set. The weighted percentage overlap per settlement category is shown in Table 4. The percentages for commercial and residential categories are very similar, while the percentage for low urban density is about 10% lower. The best way to describe the OSM road positional accuracy is the percentage overlap per province, e.g. 93% of the OSM roads in Gauteng are within 10 m of the CD:NGI roads.
Geometric accuracy of polygons
Only seven out of the twenty-seven test areas had OSM polygon data representing amenities. Fig. 4 shows the average Hausdorff distances for each test area. The average Hausdorff distances for the commercial test areas range from 9,90 m to 22,03 m with a weighted average of 11,29 m. The averages for the residential areas range from 11,34 m to 17,36 m and a weighted average of 12,54 m. Considering the size of buildings, these deviations from the 10 m requirement is not significant.
The compactness values were computed for every pair of corresponding polygons. The averages per province are presented in Table 5. The commercial compactness averages for both the CD:NGI and OSM polygons are mostly consistent between provinces. This is reflected by the low standard deviations of 0,02 and 0,03 for CD:NGI and OSM, respectively. There is a difference of 0,01 between the CD:NGI and OSM weighted averages. The polygons therefore have a similar compactness. The residential averages for the CD:NGI is varying between provinces, compared to the commercial compactness averages. The standard deviation is 0,12, slightly greater. The OSM residential averages are somewhat more consistent with a small standard deviation of 0,04. The weighted average difference is higher with a value of 0,18. The CD:NGI polygons have a greater average compactness, thus the OSM polygons are less compact in residential areas. The CD:NGI weighted average compactness for commercial and residential polygons differ by 0,01, which means the CD:NGI compactness is generally consistent.
Table 5 also compares the area ratios for the commercial and residential areas. The commercial area ratios range from 0,97 to 1,37, and from 0,97 to 1,34 for the residential category. The similarity in range is seen by the equal standard deviations of 0,18. The weighted average ratios are slightly different for the two categories with 1,06 for commercial and 1,27 for residential areas. For the commercial weighted average, which is close to one, it can be said that the CD:NGI and OSM polygons occupy similar areas. Polygons in residential areas are slightly less similar in area.
Semantic accuracy of roads
Semantic accuracy is expressed here as the number of matches between CD:NGI and OSM road classes, this is presented in Table 6. The weighted means were computed for each road class within the three settlement categories. As before, the weight was used to remove bias of different counts. The percentage matches for commercial areas were found to be the lowest. In both the residential and low urban density categories the averages for the “street” (or the residential class in OSM) category is very high. As would be expected, the count for this road class is high for residential and low urban density areas. This does not necessarily mean that the percentage match should be high. One explanation for the high percentages for the “street” category is that there is a default naming with this road class. It appears that most of the time people will classify a road as a residential road. For residential and low urban density areas, it just happens to be the correct classification. The low percentages for the commercial category provides a better indication of how often people classify roads correctly.
Completeness of OSM data
The completeness is based on the initial assumption that the CD:NGI data set is complete. Thus, in cases where the completeness percentage exceeds 100% completeness, the assumption
no longer holds true. Because no ground truth data was used, it cannot be said with certainty that the percentages above 100% imply omission in the CD:NGI data set. Instead, it implies ommission in the OSM data set. The three graphs in Figs. 5, 6 and 7 show the completeness of OSM data from October 2006 to April 2012. In all three graphs, the OSM data reaches a peak and then evens out. The commercial category had the most sites reaching their peaks during the periods 2009 to 2010 and 2011 to 2012. The residential category had the most sites reaching their peaks during the period 2011 to 2012, and for the low urban density category, during the 2010 to 2011 period. This shows that OSM South Africa received the most contributions from 2010 to 2012. Perhaps more people were exposed to volunteer mapping during this period. Specific events in time, like the 2010 FIFA Soccer World Cup, may have motivated citizens to participate in volunteers mapping in SA . For commercial areas, three of the test areas did not reach a completeness of 100%, although two of the three areas had a maximum in the 93% to 96% range. The residential category had five sites with a maximum completeness below 100%. In the low urban density category, only two sites reached a completeness value of 100% and above. The results show that less developed areas have a lower level of completeness. The completeness graphs also show that commercial areas received contributions much quicker than the other two categories and therefore even out sooner.
Analysis and discussion
The OSM roads do not meet the CD:NGI accuracy requirements. However, the overlap percentages are high for most provinces considering i) that OSM does not enforce accuracy, ii) that data is being generated by non-professionals and iii) the methods of data collection. In terms of the settlement categories, the percentages for the residential and commercial areas are close to the CD:NGI quality requirements, while for the low urban density areas they are not.
The OSM semantic accuracy of roads is only high for one of the road categories, which gives a bad impression of the non-expert’s understanding of feature classification. Overall, the semantic accuracy of OSM data is not high. This confirms a previous statement that freedom in OSM tagging results in semantic interoperability problems . By enforcing pre-set attribute ranges, the users’ tagging options may be limited, leading to stable classification.
The average Hausdorff distances for the OSM amenity buildings compare well with the CD:NGI’s stated positional accuracy of 10 m. The commercial areas have a higher positional accuracy. The shape comparisons show that polygons in commercial areas compare well with the CD:NGI polygons, whereas those in residential areas are less consistent.
Commercial and residential areas generally have a high level of completeness and in many cases exceed the CD:NGI data set. Low urban density areas are in most cases not complete. The completeness of low urban density areas may increase as more people become involved with volunteer mapping.
The investigation has demonstrated that for the purpose of integration, pre-processing of OSM data cannot be avoided, and it may be time consuming, especially for larger areas. Moreover, the quality of OSM data was found to be heterogeneous across South Africa . More developed areas receive more contributions than low urban density areas.
The data only meets some of the CD:NGI quality requirements. What this means is that the integration of OSM data into the CD:NGI data may be complicated. The degree to which these complications can be overcome will determine the level of integration. An integration workflow has been proposed, but was not presented here . Another option is to use the OSM data for detecting changes to the landscape only. Future work will investigate the modalities of integrating VGI data into CD:NGI data.
This paper was presented at AfricaGEO 2014 and is republished here with permission.
 K Mcdougall: ‘The Potential of Citizen Volunteered Spatial Information for Building SDI’, in GSDI-11 Conference. Rotterdam, p. 10, 2009.
 R McLaren: ‘Between a rock and a hard location’, Geospatial World, 2(12), pp.38–40, 2012.
 MA Mostafavi, G Edwards and R Jeansoulin: ‘An ontology-based method for quality assessment of spatial data bases’, in Third International Symposium on Spatial Data Quality. Austria, pp. 49–66, 2004.
 O Kounadi: Assessing the quality of OpenStreetMap data, University College of London, 2009.
 M Haklay and C Ellul: ‘Completeness in volunteered geographical information – the evolution of OpenStreetMap coverage in England (2008-2009)’, Journal of Spatial Information Science, 2, p.18, 2010.
 J Girres and G Touya: ‘Quality assessment of the French OpenStreetMap dataset’, Transactions in GIS, 14(4), pp.435–459, available at: http://doi.wiley.com/10.1111/j.1467-9671.2010.01203.x, 2010.
 L Siebritz, G Sithole and S Zlatanova: ‘Assessment of the homogeneity of Volunteered Geographic Information in South Africa’, in XXII International Society for Photogrammetry & Remote Sensing Congress. Melbourne, p. 6, 2011.
 N du Plooy: Assessment of OpenStreetMap Data Quality Across Different Urban Hierarchic Levels: A Comparative Study, University of Pretoria, 2012.
 L Hankel: Assessment of the quality of OpenStreetMap data in Woodhill , Kameeldrift and Mamelodi for the use of location based services, University of Pretoria, 2012.
 PA Johnson and RE Sieber: ‘Situating the Adoption of VGI by Government’, In Sui, Elwood, & Goodchild, eds. Crowdsourcing Geographic Knowledge: Volunteered Georgaphic Information (VGI) in Theory and Practice. New York London: Springer Dordrecht Heidelberg, pp. 65–82, 2013.
 F Ramm, J Topf and S Chilton: OpenStreetMap, England: UIT Cambridge Ltd, 2011.
 OpenStreetMap Foundation, 2013, viewed January 17, 2014, http://wiki.osmfoundation.org.
 OpenStreetMap Main Page, 2013 viewed January 21, 2013, http://wiki.openstreetmap.org.
 DJ Coleman, NJ Nkhwanana and
B Sabone: Volunteering Geographic Information to authoritative databases: Linking contributor motivations to program characteristics’, Geomatica, 64(1), pp.383–396, 2010.
 E Thomas, O Hedberg, B Thompson, A Rajabifard and Victorian Spatial Council: ‘A Strategy Framework to Facilitate Spatially Enabled Victoria’, in GSDI 11 World Conference. Rotterdam, Netherlands, p. 19. Available at: www.gsdi.org/gsdi11/papers/pdf/167.pdf, 2009.
 A Pourabdollah, J Morley,
S Feldman and M Jackson: ‘Towards an Authoritative OpenStreetMap: Conflating OSM and OS OpenData National Maps’ Road Network’, ISPRS International Journal of Geo-Information, 2(3), pp.704–728, 2013.
 MF Goodchild and GJ Hunter: ‘A simple positional accuracy for linear features’, International Journal of Geographical Information Science, 11(3), pp.299–306, 1997.
 N Gregoire and M Bouillot: ‘Hausdorff distance between convex polygons’, Web Project for the Course CS 507 Computational Geometry, available at: http://cgm.cs.mcgill.ca/~godfried/teaching/cg-projects/98/normand/main.html, 1998.
 SC Lee, Y Wang, and ET Lee: ‘Compactness measure of digital shapes’, in 2004 IEEE Region 5 Conference: Annual Technical & Leadership Workshop. Oklahoma, pp. 103–105, available at: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1300173, 2004
 Chief Directorate: National Geo-Spatial Information 2013, Chief Directorate : National Geo-spatial Information Standard for the Capture of Topographical Data, Cape Town.
 A Baglatzi, M Kokla and M Kavouras: ‘Semantifying OpenStreetMap’, in The 11th International Semantic Web Conference. Boston, pp. 39–48, 2012.
 L Siebritz: Assessing the accuracy of OpenStreetMap data in South Africa for the purpose of integrating it with authoritative data, University of Cape Town, 2014.
Contact Lindy-Anne Siebritz, Department of Rural Development and Land Reform, firstname.lastname@example.org