Courrier des statistiques N7 - 2022

The seventh issue and third birthday for the review since its relaunch. The ambition is always to address a wide range of the issues affecting Official Statistics. On an educational level, it addresses statisticians, whether beginners or experts, students and teachers, as well as citizens whom the “manufacture” of statistics concerns.

The first two articles cover the integration of mixed-mode data collection into the surveys, addressing the issues of which methods and tools to use to take advantage of this new approach to data collection. One major statistical operation is modernising: the agricultural census is now collected on a mixed-mode basis. Comprehensive administrative sources are more accessible, but are they easy to use? One example is the granular analysis of household property holdings.

Data may set the tone of this issue but the latter still extensively covers the instruments that allow that data to be used and heard. A good command of cloud computing and IT development techniques are proposed to ensure the quality of statistical output. Statisticians must also be able to work in conjunction with other academic disciplines, such as psychometrics in the assessment of students’ abilities. Finally, the development of a classification of crimes demonstrates how useful it is to adopt a common framework to store, classify and analyse data.

Courrier des statistiques
Paru le :Paru le19/02/2024
Mathias André and Oliver Meslin, research officers, General Economic Studies Department, INSEE
Courrier des statistiques- February 2024
Consulter

Property wealth of households Lessons from the use of comprehensive administrative data sources combination

Mathias André and Oliver Meslin, research officers, General Economic Studies Department, INSEE

The redistributive effects of the property tax according to the standard of living are poorly understood, in particular because of a lack of appropriate data. The analysis of these effects therefore required the creation of an exhaustive database on household property assets, based on multiple administrative sources.

This type of approach is likely to develop for the statistician, whether he or she works in the academic or official statistics world: the recent period has seen a significant increase in the accessibility of administrative data, as well as their number and variety. However, their use is often complex and requires solving many questions due to the very characteristics of the files: large size, varied formats, missing information or information of variable reliability.

To reconcile exhaustive administrative data, some of which has only recently become available, the project was carried out in four major phases, from data recovery to the structuring of the statistical database. In so doing, it allows us to draw some more general lessons: here, the statistician is no longer in control of the process of producing the information he uses. He must therefore acquire new reflexes and face new challenges.

presented in this article. Firstly, an economic question (Figure 1): what is the redistributive profile of the property tax on housing? Does it focus more on rich, median or modest households? Few studies have addressed this issue, with the exception of one recent publication (Ouvrir dans un nouvel ongletCarbonnier, 2019), which focuses on the main residence of households. This work may also complement the study of the tax and benefit system, like the model, for example.

Figure 1. One question behind the project

 


Secondly, a statistical question arises: is it possible to accurately estimate the property wealth of households based on administrative data and supplement survey data? Data from INSEE surveys do not provide answers to all questions, although some of them are vitally important, such as the granular geographical distribution of property wealth, concentration at the ends of the distribution or the role of tax schemes such as property investment companies (sociétés civiles immobilières – (SCI)).

Addressing these two questions required the construction of a new statistical database, using administrative data sources and systematically combining information on households and on the real estate assets they own. Once the connection between the dwellings and the households that own them is established, the subsequent stage will be to estimate the market value of each dwelling in order to obtain a measure of households’ gross property wealth. Finally, once the study work is completed, the issue of ensuring the sustainability of this new data source, in the form of ongoing statistical production, will come to the forefront.

Nevertheless, a substantial amount of work is required to progress from raw administrative information to a relevant and robust statistical result. This article narrates the journey of that project. This process of calculating property wealth faced numerous challenges inherent in administrative data: statisticians who use such sources constantly find themselves in the paradoxical position of endeavouring to address statistical inquiries using data that were not originally designed and collected for that purpose. From this perspective, this project is part of the broader use of administrative data. The insights derived from this approach are even more extensive as it has resulted in the development of a database for generating regular statistical indicators. In a sense, it illustrates the potential pitfalls and best practices concerning the use of administrative data by Official Statistics.

Administrative data: a landscape that has recently evolved

The history of statistics, in terms of both uses and methods, is intrinsically linked to the data available to statisticians (Rivière, 2020). The sources used to produce statistics and conduct studies are highly diverse:

  • results of statistical surveys;
  • administrative data;
  • data from private sector activities, such as scanner data from major retailers (Leclair, 2019) or mobile phone data (Cousin and Hillaireau, 2019);
  • tax and benefit rate parameters in microsimulation models, as in the case of the INES model (Fredon and Sicsic, 2020), etc.

In addition, key statistical outputs such as GDP, poverty rate or inflation rate are also based on combinations of these different types of sources.

Administrative data have long been used in Official Statistics for various purposes. Today, INSEE thus produces statistics on household incomes and wages using tax data and the . Administrative data are also used to set up sampling frames for drawing survey samples: this is the case with the income tax and housing tax files which, once reprocessed and included in the FIDELI file, form the sampling frame for household surveys (Sillard et al., 2020). Administrative data can also be used to supplement or even replace survey data. For example, the Tax and Social Incomes Survey (enquête sur les Revenus fiscaux et sociauxERFS) or National Survey on the Resources of Young Adults (enquête sur les Ressources des jeunesENRJ) collect income and benefit information through the data matching of tax and social data sources. The interviewers thus focus on collecting statistically relevant information that is in principle absent from the administrative files, such as the Professions and Socioprofessional Categories (professions et catégories socioprofessionnellesPCS) or activity status.

Data access has gradually increased and become easier

However, with the increase in IT capabilities and the increasing computerisation of public policies, recent years have been characterised by an increase in the number of administrative databases, together with a greater availability of those databases. This translates into a greater diversity of uses. The dissemination of administrative data in open format was a major transformation in access to official data, driven particularly by the . , free of charge. This dissemination is generally done using open formats, such as CSV (Comma Separated Value), or via APIs (Application Programming Interfaces). However, increasing accessibility is not a guarantee of quality: both statisticians and citizens can quickly find themselves disoriented in the face of data that, while plentiful, are not well arranged.

Simultaneously, some confidential sources have been made available for research purposes more systematically, notably through the Secure Access Data Centre (Gadouche, 2019). Legal and technical developments thus complement the data exchange systems which have been operated for several years through agreements (with the Directorate-General for Public Finance (see below) or the Banque de France, for example). They accentuate, or even accelerate, a movement that has been ongoing since 1986 (and Article 7 bis of the 1951 Act), through which Official Statistics enriches its output, its studies and its publications with administrative data: for example, the formation of the “all employees” panel, the use of the Nominative Social Declaration (déclaration sociale nominative) and the forthcoming use of the system are part of a long tradition of using social data within INSEE.

Thus, in just a few years, we have moved from a world with frequently used administrative data, but only for a few specific purposes, to one where the use of administrative databases is becoming increasingly widespread and diverse.

Use by statisticians, facilitated by technical developments

Beyond the increasing accessibility of administrative data, the major new feature from a statistician’s point of view is the technical developments that accompany this movement.

First, it is now easier to match these sources with each other and cross-reference them longitudinally. The presence of individual identifiers in the databases facilitates files reconciliation, . Even when identifiers are not available for the entire database or are not included, the exhaustive nature of the files still allows for data linkage “based on identity traits” (see below).

Statisticians therefore have a greater number of sources that are more detailed and more accessible at their fingertips, that they can “easily” cross-reference. Accordingly, they are able to take advantage of such sources more easily and do not hold back: thus, by bringing together administrative information, new statistical databases are being created for new uses and to meet new challenges.

A supply that responds to demand (or that induces it?)

This increasingly important role played by administrative databases is part of a context in which Official Statistics is facing new questions surveys cannot really answer. For instance, typical survey samples sizes are insufficient for in-depth cross-analysis at highly granular geographical levels or for the precise examinations of the tails of concentrated distributions like income or wealth. Official Statistics bodies are continuously pursuing enhanced efficiency: increasing the sample size or frequency of surveys would impose excessive burden on households and demand substantial resources.

Social demand, as voiced in France through the National Council for Statistical Information (Conseil national de l’information statistique – CNIS) (Anxionnaz and Maurel, 2021), now necessitates more extensive utilization of comprehensive data. One reason is that such data are more readily accessible. However, they also have other advantages.

For instance, administrative data can respond to new needs, such as the faster or infra-annual publication of information. In 2020, the unprecedented health crisis prompted INSEE to address the need for sub-quarterly forecasts. Nowcasting methods can be based on survey data, as in the case with the INES model and ERFS data. The collection time is, in principle, shorter for administrative databases, an advantage for producing statistical results more quickly.

Finally, the ability to cross-reference information from different sources contributes to the increased diversity of ways in which Official Statistics can use them. This is not as straightforward with data from household surveys, which are often compartmentalized and relatively isolated. INSEE’s major surveys are thematic in nature as they are designed to gather pertinent data on a major topic: employment, wealth, housing, etc. While this thematic approach can answer many questions on the topic concerned accurately and limit the response burden on the surveyed households, it nevertheless has the disadvantage of limiting the cross‑referencing of personal information collected across different surveys.

In a complementary manner, reconciliation of administrative sources . This is the case, for example, in respect of household information, on the one hand, and information on companies, on the other. It would be valuable to unveil companies, which involves establishing the connections between companies and the households that own them. This is what is being done, in particular, for SCIs in the project presented in detail hereinafter (Box 2).

Matching multiple administrative sources: where different worlds collide

Multiple administrative data sources were used to answer the two questions that formed the starting point of this project, with one objective: to create an exhaustive database on the property wealth of households. This project was not the implementation of a pre-established methodology, but rather a trial and error process to construct a method addressing the needs and the questions that emerged as the work progressed. In practice, conducting such a project amounted to a long sequence of problems to solve and technical difficulties to overcome. The following paragraphs describe the solutions chosen, in chronological and programmatic order. In the end, this project will have gone through four main phases (Figure 2) and had just as many challenges to overcome.

Figure 2. Match, homogenise and unify... to create a new statistical source

 

(Phase 1) Retrieving the relevant data

A prerequisite for being able to construct a new statistical database is verifying the existence of relevant data and gathering the data that will ultimately be useful in the operation. In this case, the backbone of the project consists of income and household location data, which were already available at INSEE and well documented in the (FIDELI). are the second key source for the project, which are essential for recreating the link between property and households. Finally, other sources such as the National Register of Commerce and Companies, data on elements of local direct taxation () and data on property transactions () were gradually integrated into the project as a result of exchanges with researchers and statisticians who are subject experts (Box 1).

While it may seem self-evident, this work to identify and centralise sources has often proven complex. Indeed, it is not only, or even mainly, a case of going through a specific legal procedure with a specific institutional contact point. Instead, the difficulty involved verifying the existence of administrative data that could be useful for the project, then ensuring that said information corresponded to what was needed based on its documentation, where it existed, and finally determining the best way to obtain it. One assumption that guided the search for sources was that the declarations made to the tax administration “necessarily” had to be integrated into the information system: it was therefore sufficient to find the corresponding centralised database.

More specifically, this search for sources took different forms, which reflect the diversity of situations encountered by statisticians. First of all, for the data already available to INSEE, such as land register data and the FIDELI files, it was necessary to identify the unit in possession of the data and then to highlight the relevance of the project in order to justify the request for access. Similarly, , the main difficulty was in identifying the team in possession of the data within the Directorate-General for Public Finance (DGFiP) and then establishing a data transfer agreement. Then, it was during a university conference that the research officers learned that the data from the National Register of Commerce and Companies (registre national du commerce et des sociétés – RNCS) were made available by the French National Institute of Intellectual Property (Institut national de propriété intellectuelle – INPI). Finally, data from the Identification of Elements of Taxation (Recensement des éléments d’imposition – REI) on the rates of property tax voted for by local authorities are available on the DGFiP website.

The final stage before moving on to processing was to specify the legal framework for the processing of personal data, in accordance with the General Data Protection Regulation (). To do this, it was necessary to prepare a processing declaration and to draw up a dossier on compliance with the protection of personal data (Dossier de conformité à la protection des données personnelles – DC-POD), with the support of the INSEE legal unit.

(Phase 2) Responding to the challenge of data volume and heterogeneity

Once the raw data were obtained, the second challenge was to set up an environment suitable for processing them. The standard IT tools used by statisticians were put to the test and discussions with IT infrastructure managers were necessary, both in advance and throughout the process. Both the confidentiality and volume (300 GB of raw data) of the data required the use of INSEE’s secure servers, with substantial storage space. To carry out the processing, SAS® software was the initial solution offered to the INSEE research officers. However, in the end, the project combined the use of SAS® for the major statistical production stages with the use of R for using each source.

Box 1. The five sources of the project

  • The standard land files, referred to as Majic files, compiled by the Directorate-General for Public Finance based on information from the land register. They identify built and non-built properties and their owners (52 million premises in 2017).
  • FIDELI (demographic files on dwellings and individuals) describes the dwellings and their occupants. It is created by INSEE based on tax data on individuals (72 million individuals over several years). See (Lamarche and Lollivier, 2021).
  • The National Register of Commerce and Companies (registre national du commerce et des sociétés – RNCS), created by the clerks of the commercial courts, contains information on companies and the natural persons representing them (9 million representatives of 4.5 million companies).
  • Requests for Land Values (demandes de valeurs foncières – DVF) data describe property transactions (3.5 million transactions over the period 2015–2019). It is an enriched version compared to that available in open data format (in particular, it contains a land register identifier).
  • Data from the Identification of Elements of Taxation (recensement des éléments d’imposition – REI) describe local taxation at the level of each municipality (around 36,000 observations in 2017).

The difficulty involved in processing administrative data is not only due to their volume, but also to the formats in which they are provided. Indeed, they can be provided in the form of flat files, sometimes in great number, which are often large and difficult to use directly. It was therefore necessary to structure them in a format suitable for the planned processing operations. For example, the land register data is provided in the form of 216 flat files, identifying all properties constructed in France and their owners. The first stage of the processing of this data was to structure them into seven homogeneous national files in the form of SAS® tables.

This structuring of the data also had to overcome difficulties related to the heterogeneity of the data formats, even within sources with content which is supposed to be similar. Thus, the raw data from the RNCS are derived from flat files (268 files), with the exception of those on Alsace-Moselle and the French overseas departments, which are available in XML format for historical reasons. It was therefore necessary to distinguish between two different processing operations, in order to recreate a single and coherent national file.

At the end of this phase, the data were organised in a homogeneous format and it was time to reprocess the content.

(Phase 3) Unifying the repositories, definitions and concepts (if possible)

The third challenge concerned the content of the databases used: the information within these data is intended for administrative purposes, and it may not always match statistical requirements. Statisticians, therefore, needed to conduct various checks and reprocessing operations in order to standardize concepts reflected in the data as much as possible. In this case, conceptual issues such as standardisation and definitions of variables, differences in coverage, etc. were more common than data quality concerns.

To address these issues, standardization procedures were developed to recode variables into a format suitable for statistical analysis. For example, in the RNCS, the address of the SCI owners appears in a non-standardised text field (“12, Allée des Acacias 06 000 Nice”): it was necessary to reprocess it to standardize the address and the municipality of residence using . Similarly, in the land register data, the owners’ municipality of birth is stored as unstandardized text (“Vierzon 18 100”, “Paris 01”), which was then standardised according to the Official Geographical Code (Code officiel géographique – COG).

Furthermore, new variables were defined or existing classifications reworked. For instance, the legal form of the owner in the land register data was initially codified according to a detailed classification with hundreds of categories. This classification was revised to group owners into three main categories: natural persons, SCIs and other legal persons. Additionally, the gender of the owners the RNCS data, used to render the SCIs transparent (Box 2), was not originally included. It had to be deduced from their first names, using the first names file (fichier des prénoms) published by INSEE. In this stage, the property tax associated with each property was calculated, based on the information in the land register and the property tax rates established for by local authorities, available in the REI.

The process of rendering property investment companies (sociétés civiles immobilières – SCIs) transparent also depended on various administrative sources. Land register data contain different information on individuals, depending on whether they own premises in their own name or via a SCI: when a property is owned via an SCI, the land register data contain only the company’s name and SIREN identifier and not the civil status of the owners of that company. This difference between natural and legal persons, which is understandable in terms of administrative uses, is an obstacle for statisticians who want to know information on the ultimate owners of properties, regardless of whether or not they are owned through SCIs. The property owners file was therefore matched with the National Register of Commerce and Companies, in order to obtain the civil status of the SCI owners.

Finally, the sources may contain incorrect or outdated information, which must be identified and adjusted prior to the statistical processing. For example, the property owners’ file obtained from the land register contains numerous records related to individuals who have recently passed away (due to file’s frequent updates). These records were identified by cross‑referencing this file with the “deceased persons file” (fichier des personnes décédées) published by INSEE. Similarly, discrepancies in the SIREN identifier and the legal form of legal entities were detected within the land register data. Adjustments were made to these variables using the SIRENE directory (), with a particular focus on accurately identifying property investment companies.

This phase thus made it possible to homogenise the administrative data, in respect of both their format and their content, in order to prepare the final phase, which forms the basis of the statistical analysis.

Box 2. Two good reasons to “unveil SCIs”

Taking into consideration property owned through the intermediary of SCIs is of particular importance for the study of property wealth, for two reasons.

First, the use of SCIs is much more common among households owning a large number of dwellings: 7% of households owning 2 to 4 dwellings own at least one dwelling via an SCI, compared with 31% for households owning 5 or more dwellings or over 66% for those owning 20 or more dwellings.

Second, SCIs are commonly used to share property ownership among several natural or legal persons: 50% of dwellings owned via an SCI are owned by two or more households, compared with 13% of the dwellings owned by people in their own names.

For these two reasons, it is vital to take SCIs into account in order to correctly measure the concentration of property ownership and the phenomena of co-ownership. Doing so, however, is complicated due to the intermediation caused by the use of a SCI. Indeed, when a property is owned by natural persons through an SCI, the land register data contain only the name and address of the SCI (which is the legal owner of the property) and not the identity of the natural persons associated with that SCI.

In order to study the property wealth of households, it is therefore necessary to unveil SCIs, i.e. to determine the civil status of the natural persons associated with the SCIs. This operation is carried out by reconciling land register data with data from the National Register of Commerce and Companies, which contains information on companies and the natural persons who represent them.

(Phase 4) Changing the unit of analysis to create a new statistical source

To complete the process, it was necessary to restructure the data to create an actual statistical database. Administrative data are typically structured to meet the needs of the collecting administration, and the observation unit rarely aligns the requirements of statisticians.

In the context of property wealth, the administrative unit of management is the property, also referred to as the premises. Land register data are therefore organised in a way that facilitates access to information about each premises: each premises is identified by a unique identifier. It is thus quick to find out the list of the owners of a given property or to calculate the property tax due on it. Conversely, land register data are not well-suited for an individual-based approach: they do not contain a unique national identifier of individual property owners, instead only their civil status. Therefore, compiling a list of properties owned by an individual is a complex task. However, from a statistician’s perspective, transforming administrative sources into statistical sources involves shifting from the administrative unit of management (here, the premises) to the unit of statistical interest (here, the household). This transition from the administrative premises to the statistical household is a pivotal point in the transformation of administrative into statistical sources. This change in the unit of analysis necessitates significant reprocessing using FIDELI, which serves as the reference source for individuals and households.

The process began by assigning a unique identifier for each natural person to the property owners’ file to create a comprehensive list of properties owned by each individual. To achieve this, a matching process involving identity traits was carried out between the property owners’ file and the individuals’ table in the FIDELI directory, which encompasses all adult individuals known in the tax sources. This matching operation involved seeking out an individual in FIDELI whose civil status and address are identical, or closely resemble, to occurrences in the property owners’ file of the land register. Consequently, it successfully identified all premises owners who were natural persons in 94% of cases and, at least one of the owners in 98% of cases.

Subsequently, individuals were grouped into households so as to establish a list of premises owned by each household. This grouping was carried out using the FIDELI directory, which determines the location of persons: as a convention, individuals who share the same principal residence belong to the same household.

It is at the end of this changing the unit of analysis phase, and only then, that the reprocessed data (finally) constitute a statistical source. It then becomes possible to build statistical indicators and conduct studies. In particular, it is at this point that we define the coverage of the property tax study: dwellings and outbuildings situated in French territory and owned by the households residing in them, either with full ownership or usufruct thereof, in their own name or through a property investment company. Thanks to the processing carried out during the four successive phases, and particularly thanks to the matching between sources, it is ultimately possible to use such a precise definition, with such wide coverage.

The newly established source empowers us to address the initial inquiries, such as investigating the redistributive impacts of property tax or the distribution of property (André, Arnold and Meslin, 2021). It offers a myriad of potential applications, underscoring the depth of possibilities through the use of administrative data (Figure 3).

Figure 3. The new database answers questions that were previously unanswered

 

What challenges were unveiled by this use of administrative data?

Encouraged by this experience, we aimed to distil valuable insights for statisticians embarking on a similar journey. These lessons can be summarised in three distinctive characteristics of administrative sources and four key questions that must be asked when administrative sources are used for statistical purposes.

In our view, administrative sources exhibit three distinctive characteristics:

  • they are exhaustive… but relate to a certain field, the definition of which is relevant with regard to the objectives pursued by the administration that produces them. For example, the housing tax file includes all tax households subject to that tax, but does not include individuals living in collective housing who are not subject to housing tax;
  • its content reflects the management work carried out by the administrations and is the result of the application of administrative procedures (declarations from tax subjects, issuing of tax notices, etc.). This content is therefore constantly evolving (creation and deletion of records, updating of information);
  • its content is not the product of a formal collection process in the statistical sense. Therefore, information relevant for performing statistical analysis may be absent from the administrative data and the metadata available on these databases may be incomplete, or even non-existent.

Ultimately, statisticians must contend with administrative data rather than directly shaping them. This finding leads us to propose four subjects for consideration in advance, which are essential for constructing a database from these particular sources.

When should the information be observed?

The first issue is the need to “harvest” data on a certain date, in order to be able to use them: on what date should the data be “frozen”? Should the sources be updated, to take into account the fact that the administration may update its data with a delay?

Statisticians then face the difficulties of using “bi-dated” data, i.e. data including an event date (such as a marriage) and an entry date into the administrative data (the recording of the marriage in the files). Indeed, administrative databases can change every day, according to their harvesting and updating processes. Some files may be annual but with amounts corrected in several waves, as with the income tax or the , taking account of delays and undue payments only several months later. In practice, the reference date of administrative data used by statisticians are more frequently determined by the administration’s processing date than by the socio-economic events date they describe: that is why statisticians may use the income declaration database at the end of the third income tax issue (“POTE 2017, third issue”), the 6-month file for social data, the extraction from the National Register of Commerce and Companies on 6 May 2017, etc.

What type of exhaustiveness are we talking about?

Second, administrative databases are emphasised due to the exhaustiveness of the information they contain. However, the exhaustiveness of a data source is a complex concept to comprehend on a statistical level, as it can only be defined in relation to the intended coverage of the sources involved, which itself depends on the way it is used: for example, some social files are exhaustive in their coverage of social benefits or minimum social benefits, in that they include all recipients, but a person who does not receive such benefits will not appear in the database. Exhaustiveness is therefore related to the coverage of the administrative source. It will depend on:

  • the limits of the database set by the administration: Metropolitan France? Resident households?
  • the collection process and the management purpose: incomes for paying taxes, careers for receiving pensions or unemployment benefits, etc.

Let us not forget, the information is only present for management purposes. The definitions used may therefore differ from the usual statistical definitions. With no statistical purpose, an administration usually collects solely the information it needs to fulfil its missions: collect taxes, pay benefits, etc. As a result, essential information for statistical analysis may be missing. This is the case, for example, with disposable income, which is absent from tax sources. Consequently, an administrative database alone is not always sufficient for statistical purposes and it becomes necessary to match several databases before getting a source that gathers all the relevant information.

In this respect, the existence of repositories common to several databases, which are often essential for the work of administrations, can facilitate the statistical work of cross‑referencing information, by making accurate links between all or part of different databases possible. This is the case, for example, with the tax identifier for the various DGFiP sources or the National Registration Number (numéro d’inscription au répertoireNIR), which is commonly referred to as the social security number, for the CNAF. It is also possible to carry out matching based on identity traits in the absence of a common repository, as was the case in this project. The collection process for an administrative database differs greatly from the one used for a statistical survey. This is why a third point must challenge statisticians: the reliability of the information.

How reliable is administrative information?

A key difference between administrative sources and survey data is that the former are not weighted. Each administrative unit of management (household or company, for example) represents only itself and corresponds in a way to the “true” observation, rather than to an estimator of an entire population by a representative sample. Therefore, the question of representativeness is not so much about the correction of non-response but about the coverage of the administrative source (see above) and the reliability of the information it contains.

However, the reliability of a variable in an administrative source often relies on the importance of that information in the management process: an income variable on which a tax calculation is based will usually be highly reliable because it is subject to appeal by the taxpayer and carefully collected and checked by the administration. Similarly, in the land register data, the address of the owner to whom the property tax notice is sent is probably more reliable than the address of any potential co-owners, because only this address has a direct effect on the collection of property tax. Conversely, a less important variable, such as a person’s age or the general condition of a property, may be of lower quality or be updated less frequently, due to its reduced importance in the management process. In the same way, declarative variables (such as property wealth in the property wealth tax (impôt sur la fortune immobilière – IFI) or pre‑completed variables (income declared by employers) will have different degrees of accuracy.

Assessing reliability is a work involving key questions for statisticians: what is the administrative definition of this variable? Who provides the information? Is it up to date? Has the administration checked its quality? Answering these questions requires a detailed understanding of the administrative process that led to these data and a frequent dialogue with the data-producing administrations. Finally, a last question arises for statisticians who reconcile or rely on administrative databases.

What are the relevant coverage and unit of analysis for statistical use?

This is the unavoidable stage for statisticians, because it is by addressing this question that they turn administrative databases into databases for actual statistical use. In a survey process, this thought process takes place upstream, . For the use of administrative sources, it is part of each stage, both when searching for sources and in the background at all stages of the production of the final database. In this project, it was necessary to answer the following questions, in particular:

  • what types of properties should be studied? Only dwellings, or also outbuildings (garages, cellars, etc.) and even industrial and commercial premises?
  • what types of ownership should be included? Only properties people own in their own name or also those owned via a company (.)?
  • what ownership rights should be considered? Only full ownership? What about usufructuaries and ?
  • what unit of property wealth should be studied? At the level of individuals, tax households or households?

Thus, the various “nuances in exhaustiveness”, the lack of statistical production purposes and the differences in the definitions of observation units highlight the importance of carefully carrying out .

What lessons should be taken away from the project?

Three main lessons can be learned from this work:

  • simply having access to administrative sources is by no means sufficient for rigorous statistical operations. From an initial idea, many stages are needed to progress from rich but heterogeneous and scattered initial databases, to a coherent statistical database. Like other statistical work, the various difficulties that had to be overcome corresponded to a path towards ever-greater homogeneity, as administrative data can use distinct conventions, definitions and classifications. It is then the statistician’s job to reconcile and harmonise them with patience and precision;
  • the systematic use of administrative data is well suited to a fully-fledged collection method of Official Statistics. However, there is one major change in comparison with other methods: statisticians no longer control the data production process. This evolution means developing a specific expertise: understanding the working of the administrations that produce the data, interacting with the producers and anticipating changes. This also highlights the limitations imposed by the very nature of these data, which may lack important variables for the study or whose content may be of relative reliability depending on its importance in the day-to-day activities of administrations;
  • the use of administrative data opens up new possibilities for Official Statistics for at least two reasons. First, exhaustiveness makes it possible to study rare phenomena, such as the highest property wealth or the granular cross‑referencing of several variables, and to perform analyses at a very detailed geographical level, which is difficult to achieve with surveys. Second, innovative productions are becoming possible thanks to data matching operations: matching between land register data and the FIDELI directory makes it possible to recreate the joint distribution of income and property ownership, which did not exist in any source with this level of detail. Similarly, unveiling SCIs makes it possible to go beyond the usual distinction between household data and companies' data and thus to study poorly documented economic phenomena, such as SCI usage behaviour in accordance with the composition of the household wealth and income level. This work could lead to the introduction of a module on property wealth in the FIDELI files.

The use of administrative data is both a challenge and an opportunity. It is a challenge because it requires statisticians to develop new skills and to renew their methods. It is an opportunity because it broadens the range of information available to Official Statistics, providing a source of great wealth to fulfil its mission of measuring and understanding economic and social phenomena.

Legal references

Paru le :19/02/2024

The authors of the article work in the General Economic Studies Department and this project was carried out in direct connection with the Housing Division of the Directorate of Demographic and Social Statistics, which produces the Housing and Individual Demographic Files (Fichiers démographiques sur les logements et les individus – FIDELI).

INES is the acronym of “INSEE-DREES”, the two bodies that jointly develop the model: CNAF has also co-managed the model with INSEE and DREES since 2016.

See reference at the end of the article.

An interested party need only consult the open platform of French official data (Ouvrir dans un nouvel onglethttps://www.data.gouv.fr/fr/) regularly and look at the number of data sets available (almost 40,000 as of the date of writing this article).

In recent years, the Nominative Social Declaration has brought about a major simplification of the procedures for the declaration of wages and income paid by an employer. See (Humbert-Bottin, 2018).

The Admission of Other Income (Passage des revenus autres – PASRAU) system is a continuation of the work to simplify and streamline the social declarations made via the DSN, which it supplements for “replacement income”.

In this respect, the Statistical Directory of Individuals and Housing (répertoire statistique individus et locaux d’habitation – RESIL) and the Non-Identifying Statistical Code (Code statistique non signifiant – CSNS) projects carried out at INSEE illustrate the central role of data matching in statistical work.

For example, an anonymised panel of customers from certain banks, see (Bonnet, Loisel and Olivia, 2021).

For example, tax and social data to measure standards of living.

See (Lamarche and Lollivier, 2021) for further information on FIDELI.

In this case, the Updating of Land Register Information (mise à jour des informations cadastrales – MAJIC) files of the Directorate-General for Public Finance (Direction générale des finances publiques – DGFiP).

Identification of elements of local direct taxation.

Land valuation requests (Demandes de valeurs foncières), a notary and land register data source.

INSEE retrieved the data in 2019 during the project. This version is richer than that available in open data format, in particular because it contains a land register identifier.

See reference at the end of the article.

The FANTOIR repository (a computerised directory of streets and place names in France, formerly the RIVOLI file, managed by the Directorate-General for Public Finance (DGFiP)), for the street, Official Geographical Code (INSEE), for the municipality.

Business Register Identification System (Système informatique pour le répertoire des entreprises et des établissements), an administrative directory of companies managed by INSEE.

National Family Benefits Fund (Caisse nationale des allocations familiales – CNAF).

In the Generic Statistical Business Process Model (GSBPM), these aspects are addressed in the design phases, at the very beginning of the process.

Property investment company (société civile immobilière – SCI), public limited company (société anonyme – SA), limited liability company (société à responsabilité limitée – SARL) or simplified joint stock company (société par actions simplifiée – SAS).

Bare ownership is the right that grants the holder thereof, called the bare owner, the ability to dispose of a piece of property (by selling it, giving it away, bequeathing it, etc.), while usufructuaries only have a right to have use of the property.

See also the experience on matching reported in Midy (2021).

Pour en savoir plus

ANDRÉ, Mathias, ARNOLD, Céline et MESLIN, Olivier, 2021. 24% of households account for 68% of privately owned housing. In: France, Portrait Social. [online]. 25 November 2021. Insee références, édition 2021, pp. 91-104. [Accessed 15 December 2021].

ANXIONNAZ, Isabelle et MAUREL, Françoise, 2021. The National Council for Statistical Information (Conseil national de l’information statistique) - The quality of Official Statistics also depends on consultation. In: Courrier des statistiques. [online]. 8 July 2021. Insee. N°N6, pp. 123-142. [Accessed 15 December 2021].

BONNET, Odran, LOISEL, Tristan et OLIVIA, Tom, 2021. Impact of the health crisis on an anonymised panel of La Banque Postale customers. Most customers’ incomes were affected in a limited and temporary manner. [online]. 3 November 2021. Insee Analyses n°69. [Accessed 15 December 2021].

CARBONNIER, Clément, 2019. The Distributional Impact of Local Taxation on Households in France. In: Économie et Statistique / Economics and Statistics. [online]. 11 July 2019. Insee. N°507-508, pp. 31–52. [Accessed 15 December 2021].

COUSIN, Guillaume et HILLAIREAU, Fabrice, 2019. Can Mobile Phone Data Improve the Measurement of International Tourism in France? In: Économie et Statistique / Economics and Statistics. [online]. 11 avril 2019. Insee. N°505-506, pp. 89–107. [Accessed 15 December 2021].

FREDON, Simon et SICSIC, Michaël, 2020. INES, the Model that Simulates the Impact of Tax and Benefit Policies. In: Courrier des statistiques. [online]. 29 June 2020. Insee. N°N4, pp. 42-61. [Accessed 15 December 2021].

GADOUCHE, Kamel, 2019. The Secure Data Access Centre (CASD), a Service for Datascience and Scientific Research. In: Courrier des statistiques. [online]. 19 December 2019. Insee. N°N3, pp. 76-92. [Accessed 15 December 2021].

HUMBERT-BOTTIN, Élisabeth, 2018. The Nominative Social Declaration. A New Reference for Employment Data Exchanges from Firms to Government Agencies. In: Courrier des statistiques. [online]. 6 December 2018. Insee. N°N1, pp. 25-34. [Accessed 15 December 2021].

LAGARDE, Sylvie, 2008. Ouvrir dans un nouvel ongletLa nouvelle exploitation exhaustive des DADS. In: Courrier des statistiques. [online]. June 2008. N°85-86, pp. 65-69. [Accessed 15 December 2021].

LAMARCHE, Pierre et LOLLIVIER, Stéfan, 2021. FIDÉLI, The integration of tax sources into social data. In: Courrier des statistiques. [online]. 8 July 2021. Insee. N°N6, pp. 28-46. [Accessed 15 December 2021].

LECLAIR, Marie, 2019. Using Scanner Data to Calculate the Consumer Price Index. In: Courrier des statistiques. [online]. 19 December 2019. Insee. N°N3, pp. 61-75. [Accessed 15 December 2021].

MIDY, Loïc, 2021. A matching tool using indirect identifiers - The example of the information system on the integration of young people (système d’information sur l’insertion des jeunes). In: Courrier des statistiques. [online]. 8 July 2021. Insee. N°N6, pp. 82-99. [Accessed 15 December 2021].

RIVIÈRE, Pascal, 2020. What is Data? - Impact of External Data on Official Statistics. In: Courrier des statistiques. [online]. 31 December 2020. Insee. N°N5, pp. 114-131. [Accessed 15 December 2021].

SILLARD, Patrick, FAIVRE, Sébastien, PALIOD, Nicolas et VINCENT, Ludovic, 2020. INSEE is Updating its Household Survey Samples. In: Courrier des statistiques. [online]. 29 June 2020. Insee. N°N4, pp. 81-100. [Accessed 15 December 2021].