Courrier des statistiques N1 - 2018

The first issue has already been published. It includes an article by the General Director of INSEE on the administrative structure of the French official statistical system based on his presentation delivered at the World Statistics Congress of the International Statistical Institute in 2017.
A four-part dossier then examines the use of administrative sources in statistics, with, among others, a presentation of the French electronic reporting system designed for employers known as the DSN (standing for déclaration sociale nominative, or Nominal Social Declaration) by the Director of the French Public Interest Group (GIP) ‘Modernisation of Social Declarations’. The focus then turns to another topic altogether: the implementation of the global system enabling unique identification of legal entities participating in financial transactions (known as the Legal Entity Identifier, or LEI) and the role played by INSEE in this area. Finally, the last article provides an informative overview of the notion of official statistics in its various facets at both the French and European levels.

Courrier des statistiques

Paru le :Paru le06/12/2018

Imprimer

Pascal Rivière, Head of the General Inspectorate, INSEE

Courrier des statistiques- December 2018

Présentation

Consulter

Sommaire

Using Administrative Declarations for Statistical Purposes

Pascal Rivière, Head of the General Inspectorate, INSEE

To complement survey data or replace it, the use of administrative data by statisticians is becoming commonplace. And yet, strangely, it is a phenomenon that has not been subjected to much general methodological investigation. The particular case of data from administrative declarations is interesting because the related production process bears some similarities to the statistical process. With a particular party involved: the organisation in charge of managing data flows, which is often different to the government entities that use them. The latter saves the statisticians work and in some ways facilitates it... even though it means they lose their power over a part of the operations. Ultimately, the data file put together is the result of a statistico-administrative co-construction. To optimise that construction, it is necessary to emphasise the importance of having an ongoing, formalised relationship with the entity that manages data flows, which gives statisticians more possibilities for intervention than they might imagine.

Sommaire

Wealth of administrative sources, weakness of methodology
A specific status and usages
Data freezing, the key to administrative statistics
The notion of the administrative declaration
The concentrating organisation a declaration "HUB"
Similarities with survey statistics: the scope, the variables...
... and the way the collection system works
On the other hand, a loss of control for the statistician
In reality, a statistico-administrative co-construction
Data quality, a key question
Closed world assumption and backtracking
Another way of looking at the statistician's job
Box - The standard statistical production process of a survey

Wealth of administrative sources, weakness of methodology

In the statistician's armoury, administrative sources are occupying a growing place for a variety of reasons. They enable survey data to be completed and enriched, as long as matching is possible: the introduction of the future "hashed NIR" will, from this point of view, offer some entirely new perspectives. It also happens that such data replaces surveys, in a joint context of an explosion in the possibilities in terms of data use (Elbaum, 2018), but also more classically a general drop in response rates and budget restrictions. Finally, they can constitute a new source, different to the areas usually explored by the statistical system.

And yet, the statisticians who use these data know that things are far from obvious, in particular because statistical and administrative production processes differ. There is a vast methodological corpus dedicated to survey statistics (see, for example Lyberg et alii, 2012), but there is nothing of the kind for those that come from administrative sources. Admittedly, there is a sizeable literature on how to deal with such and such a particular source of data, the problems encountered and the solutions put forward (Cf. all the articles of Session 20 ("Matching – administrative files") of the Statistical Methodology Days (Journées de méthodologie statistique) 2018). However, it is difficult to find a cross-cutting article in the literature in French that provides a general, operational methodological framework – with the exception perhaps of Rouppert (2005) – in the same way as the theory of surveys does for example. In the literature in English, particular mention should be made of Hand's article (2018), which gives an overview of the issues raised by administrative data.

This lack of a solid conceptual framework sometimes leads to a form of home-made organisation of production, to a methodological deficit. And at the same time, no-one is really asking themselves what is specific about administrative data, approaching them, without any further ado, as a whole, just a data file to be processed.

The aim of this article is to question the specificities of a particular set of administrative data; those that come from administrative declarations. We do in fact see some welcome similarities with statistical production (see Box), and at the same time some limitations of which it is essential to be aware.

A specific status and usages

When we combine the adjective "administrative" with the word "data", what do we actually mean? This would be a vast subject to explore. Here we will limit ourselves to three specificities: status, usage, movement.

First of all, the term "administrative" amounts to qualifying the data by their origin, and more than that, by their status. It means that they come from the administrative sphere in general; this corresponds to government entities, more generally public bodies, and furthermore everything that comes under the duties of the Government. This often involves a high degree of centralisation, at least in France, and this does facilitate the statisticians' job.

The administrative nature of data also tells us about their usage; they are encompassed in the management processes of one or more organisations (what we will call "government entities" in what follows), which have their own agenda and their own objectives. Unlike statistical data, whose only purpose is to inform, administrative data owe their existence to the actions that they allow to be initiated. They are not neutral. For example, the data "amount of the retirement pension" for a given individual allows the actual amount to be paid to be calculated and the payment to be made, i.e. concrete actions in the real world.

The usage can go beyond the government entity concerned. So for example, the information in the National Joint Social Protection Register (RNCPS) can be used to identify inconsistencies between social benefits of various kinds; the Nominative Social Declaration (DSN) is used by the French Tax Office, the pension organisations, the Central Agency for Social Security Bodies (ACOSS) for their own administrative usages; the data from SIRENE (register of businesses) serve as a reference, as proof for enterprises, and they are also used by the chambers of commerce and industry and by the registries of the commercial courts.

Data freezing, the key to administrative statistics

The fact that administrative data result from a business process also has operational consequences: they are found in operational databases, which are always liable to be updated. In a way, they are living data, which can be modified according to events, internal or external. These events can occur at any time and some of them are totally exogenous, beyond the government entity's control and unpredictable.

This is not without consequences for the statistician, who is used, with surveys, to recovering a set of data that is fixed in time and which is connected to precisely defined statistical units. Admittedly, the values of these statistics may vary due to data editing and imputation, but basically it will be the same data that we are talking about. The continuously changing nature of administrative data is therefore unsuitable for the statistician, who can only work on fixed data, which will be in file form. That data will necessarily be a snapshot at a given moment in time, but also a selection of variables and units of interest to them. It is necessary to go from a world in motion to a freezing of these data over a given period.

But beware: the question of temporality is not a simple subject. It is thus necessary to distinguish at least two levels of dates: the date when the information was obtained, first, and its reference date, second. For example, in March 2018 (date obtained) we retrieve the number of employees a company had on 31 December 2017 (reference date). Things start to get complicated when you find that the reference date can become a reference period (e.g. the year 2017), and that this can involve a whole calculation, a whole reconstitution of the data: for example, the company's average full-time equivalent workforce over the year 2017.

The notion of the administrative declaration

Administrative data can have several origins. They often result from internal management processes in the government entity concerned . But they may also originate in administrative declarations.

What does this mean? An administrative declaration is an obligation for a certain number of entities (individuals, enterprises, public bodies) to provide information in a certain form, by certain methods (internet, paper forms) and within certain temporality. For example, the various tax returns are very precisely documented systems, and they are compulsory, to be submitted over a given period of time and at specific intervals (income tax returns are annual), for the individuals or enterprises liable for such taxes. In the business world, the Nominative Social Declaration (DSN) is a monthly requirement (except for events-based declarations), and the rules to be followed can be found in a versioned technical booklet that is accessible online (See C. Renne's article on the use of the DSN in the same issue.), as it is essential for declarations to be standardised.

The existence of a form of obligation for administrative declarations, often imposed by legislation and regulations and backed up by a highly standardised body of documentation, is an advantage to the statistician. In fact, this obligation takes various forms of coercive power, such as the possibility of taking those who do not comply to court. Which considerably reduces, without banishing it altogether, the risk of non-declaration.

Note, however, that the administrative declaration itself is not, strictly speaking, an administrative data source: it is not a file, it is a flow. And the way of building a fixed data file from that flow is, moreover, a subject in itself. In particular, producing such a file is not limited to simply piling up declarations. Thus, for the DSN, a company can make corrective declarations, declarations relating to earlier periods, but also events-related declarations, which may lead to data being modified. From this point of view, the declaration process differs from the survey process as there is nothing to prevent modifications being made during the "collection" period at the declarant's initiative and linked to the life of the company. Work, sometimes complex work, to reconstitute and even consolidate the data is therefore necessary to build an administrative source, based on all the declarations. In certain cases, the situation is further complicated by the use of several types of administrative declarations.

The concentrating organisation a declaration "HUB"

This flow management logic leads to a fundamental distinction between two entities: on the one hand, the administrative agency, using the data for its own management purposes, and on the other the concentrating organisation. The latter is an administrative data hub that manages the declaration process overall: it creates the necessary information system, documents it, runs the governance bodies and working groups, organises the contact with the declarants, manages the classifications, communicates and monitors the process. It concentrates all the flows, incoming from the declarants, outgoing to the user government agencies. But this hub is not necessarily a part of those administrative agencies, as its role as a flow manager is a separate one.

This is typically the role of the Gip-MDS with the net-entreprises platforms, where social declarations like the Nominative Social Declaration (DSN) can be filed. But others that can be mentioned include the ATIH (Ouvrir dans un nouvel ongletTechnical Agency for Hospitalisation Information) with the e-PMSI platform (Ouvrir dans un nouvel onglethttps://www.epmsi.atih.sante.fr/), which is designed to collect and analyse information about hospitalisation, or the SANDRE, which guarantees the interoperability of water-related information systems (even if in both these cases the requirement to make declarations is less formal than for the DSN). Managing such data collection on an industrial scale is a professional activity in its own right requiring a high level of technicality (IT, project management, law), a tried and tested, responsive organisation, and a governance that includes user and contributor organisations. The user entities are, for example, the Tax or Social Contributions Authorities in the case of social declarations, the National Health Insurance Organisation (CNAM) and the hospitals in the case of the PMSI data, the French Biodiversity Agency (AFB) and the Water boards for the SANDRE.

With the declaration, we are also in a logic of standardised flows, clearly positioned in time and linked to an obligation to declare information to given entities. If the declaration process is rigorously managed by the concentrating organisation, we know what happens to the data from the single primary source of the information. In particular, data checks are carried out at one place only and they are documented.

Similarities with survey statistics: the scope, the variables...

The concentrating organisation manages a process that has some points in common with the statistical process (see Box), starting with the scope /variables/legal framework trio.

The scope is as a rule formally identified and known; it is all the persons, or entities, liable to make the declaration. It is the subject of a definition, with a general framework, but also exemptions, and special cases: for example, enterprises and firms "with difficulties achieving the configuration on the complementary organisations part" were exempted from the DSN in phase 3.

The list of data to be collected may be the subject of discussions between the parties involved, as it is necessary to take account of all the parties' needs without multiplying the data to be collected (this was precisely the problem with the DADS (Annual Declaration of Social Data)). This, based on the law and seeking to achieve mutualisation via a common conceptual model, makes the decisive difference between the DADS and the DSN. The data to be collected are defined and documented, as is the way the exchanges are standardised: here we are thinking of the technical booklet for the Ouvrir dans un nouvel ongletDSN, but also the very rich documentation on the SANDRE (data dictionary, data model, exchange scenarios, organisational and technical rules for the reference data). In addition, the concentrating organisation draws up, maintains and disseminates the reference classifications, which is a substantial task: thus, for the ATIH, the International Classification of Diseases (ICD), or the Common Classification of Medical Interventions (CCAM).

... and the way the collection system works

Now the data have to be collected. First of all, it should be noted that the data collection, can look very like an online questionnaire. Data collection is generally exhaustive... for the scope in question. Subject to this reservation, this exhaustivity is achieved by rigorously following up those who fail to make the declaration, which generally done much more seriously than the follow-up of non-respondents in surveys: the coercive power resulting from the obligation to declare is out of all proportion to the very limited power statistical institutions have over non-respondents; everybody is aware of the consequences of failing to file a tax return on time or at all. And yet we see a strange under-use (and even non-use) by statisticians of the very comprehensive information that is produced by the follow-up of non-declarants. This is the case for the DSN, for example.

Like statistical questionnaires, administrative forms are also subject to data editing. Their logic is slightly different to that of data editing in the statistical production process. This means that when a DSN declaration is filed and a check reveals something that is not satisfactory, the declaration will not "go through" – it will be blocked, as if it had not been sent. The declarant then has to repeat the operation until the declaration "goes through". The edits include checks on belonging to a pre-defined list (e.g. the official list of country codes, postcodes), on the formal consistency of the data on the nature of data (numerical, alphanumerical, date, etc.), and even its fine structure (cf. complex rules concerning the form of e-mail addresses). But these automatic verifications are limited to "hard" edits, which absolutely must be verified: care is taken not to implement too many checks as this would seize up the declaration process.

On the other hand, a loss of control for the statistician

The way in which the collection of information is prepared and organised, although there are some significant similarities with the statistical process (basically the first four phases of the GSBPM), does, however, have one major disadvantage: all the phases thus described are under the responsibility of the administrative data hub and therefore out of the hands of the statistical service. The scope, data and form used are all things that can change over time, without the statistician being able to exercise the slightest control over any of these aspects. For example, if a category of population is no longer liable for one tax or another and therefore disappears from the scope, the statistician has no power over that decision. Nor does he have any say in the variables collected or in their definition itself: the latter is the subject of much discussion with the organisations using these data, where the Ministerial Statistical Departements are not those with the most weight. The statistician therefore loses control over the concepts and the framework of the data collection.

He also loses responsibility for the collection itself to a third party, the flow management organisation. The statistical production services also see an essential part of their work disappear along with that: the direct contact with the data collection units and the possibility of checking the information immediately with the persons interviewed. This interaction with those in the field, this anchoring in reality, vanishes.

In place of a living, dynamic collection process, there is just a data file, a static and soulless projection of an outsourced process.

More generally, the statistician does not decide how these data are obtained: the organisation of the declaration process, the relations with the declarants (FAQ, communication), the information system as a whole. Finally and especially, this whole organisation can also change over time, as a result of new legislation, without him being consulted.

In reality, a statistico-administrative co-construction

Administrative declarations do not provide the statistical offices with ready-to-use information. The statistician still has a lot of work to do downstream, as the data are often unusable as they are. To arrive at a clean individual data file suited to their uses, statisticians therefore have to set in train a second process after the purely administrative process. The result is therefore a co-construction between these two worlds. Statisticians therefore have to carry out several operations:

Extra checks, which have not been carried out on the administrative side: this would most likely mean credibility checks, which verify the plausibility of a combination of data on subjects that are sensitive from a statistical point of view. For example, it will be checked whether the name of the profession declared is consistent with the PCS code, or whether the amounts of income are consistent with each other.
Data transformation: data from administrative declarations are to a certain extent pivot data, which each government entity uses for its own purposes. This also applies to the statistical offices: to obtain the real data useful to the studies, it is sometimes necessary to calculate new data derived from the declaration. These "transformations" can take very simple forms (e.g. calculation of age from a data of birth, or age group from the age). They can also result from a more complex calculation (e.g. determining periods of activity from start and end dates on separate declarations) or require transformations of classifications, whose level of detail is sometimes of no use for the purposes in question (e.g. you do not always need all the details of the industry classifications). Whether simple or not, these transformations must be rigorously defined to be applied systematically.
Change to different statistical units, which involves a higher degree of complexity; for example the DSN "reasons" in units like the employee, the contract, whereas statisticians are more likely to be interested in the job in some situations. It is therefore necessary to reconstitute the corresponding data, which requires a lot of rigour.

In doing all these things, the data are formatted and brought into line with statistical quality standards, replacing the original formatting and administrative standards.

In the end, the concentrating organisation performs a large part of the statistical production process in the place of the statistical office (collection, interaction with the declarants, following up non-declarants, a large part of the data editing) and the latter completes it with additional data editing, adjustments, transformation and bringing into line with statistical standards. With one significant proviso, the loss of control: nothing is imposed on the organisation in question.

Data quality, a key question

Using administrative data for statistical purposes requires "quality" data… David J. Hand's article on the subject mentions "quality problems" with administrative data on several occasions. But what are we talking about exactly? Issues relating to the non-quality of data in general are the subject of a vast literature (for example McCallum 2012), but for statistical uses, they come into play on several levels: whether the population matches the desired scope, validity of the data semantics (which may differ from what the statistician wants), existence of adequate edit rules (certain data considered of little importance are not even checked), problems with dates, statistical units and so on.

In the case of data from administrative declarations, the fact that checks have already been carried out as part of the declaration process, and that they are accurately documented and versioned, already constitutes a large step, a first-level guarantee. Essentially each edit rule corresponds to a property that the declared data must respect: belong to a list, be numerical, be present if such an item is present... But these are guarantees of form or structure that do not ensure sufficient quality.

For beyond the syntax, it is necessary to agree on common, clearly defined semantics. In the examples cited above, such as the DSN or the water information system, there is therefore, well before any considerations of standardising exchanges, a common conceptual data model applying the standards of the unified modelling language (UML). This leads to a rigorous definition of the meanings of the data, the concepts and links between concepts. On these questions, it should be emphasised that Belgium played a role as a precursor when it set up, back in 1990, the "Crossroads Bank for Social Security" a public social security institution in charge of exchanging data between the different social security institutions – and a typical example of a concentrating organisation (or administrative data hub). With the Crossroads Bank, which was a real source of inspiration for the DSN, a conceptual data model was gradually developed, shared first (Robben et alii, 2006) and then the subject of a law that made it legally enforceable. As a result, all the data flows from companies use the same semantics, the same, regularly versioned classifications, the entire system being fully documented and placed on line. This system, which was well ahead of its time, is still in operation and efficient.

Closed world assumption and backtracking

The semantics that we are talking about here characterises a world that is not unchanging. Accordingly, combinations of data that are considered as anomalies at a given moment may well not be any longer at a future point in time, because the world has changed (e.g. same-sex married couples, a situation that was impossible at one time, possible later). Assuming that data quality can be defined by a list of formal properties, which can be verified by formal edit rules (e.g. checking that in a couple, the two people are of different sexes), amounts to implicitly applying the "closed world assumption". Whereas, as we have just seen, the latter inevitably changes.

How can we give these developments an operational framework and use it to apply an effective operational approach? The response comes to us from Belgium again: in what has become a reference work based on his thesis, Boydens (2000) provides a highly original insight into the subject with the notion of "layered temporalities" taken from the work of historian Fernand Braudel. The data are subject to three temporalities: legal temporality, the temporality of databases and the temporality of the real world. The three are not necessarily adjusted or synchronous, far from it: the legal can adjust to the real with a delay, the structures of databases do not have the suppleness necessary to adapt automatically to changes.

To bring these temporalities closer together, for want of being able to synchronise them, the backtracking method (Redman 1996, Boydens 2000, 2018) offers an effective operational approach, which the author has applied, also to administrative data, more specifically to social security data.

This method consists of identifying, based on the breaches of "business rules" most frequently observed at the source within the database, the structural causes and of remedying them by tracing the flows of information back upstream, which enables action to be taken much more quickly and more reliably since the verification is done at source. What is the principle behind this? Data flows, whether they are administrative or statistical, are naturally subjected to data checks. For example, if a given benefit cannot be paid above a certain level of income, the consistency between the "existence of the benefit" and "income" data will be checked. And if an inconsistency is found between the two, it will generate an anomaly. So far, there is nothing especially original about this.

The innovative element (among others) is considering the anomalies as objects of interest in their own right, and monitoring changes to them over time by type of anomaly. And as soon as one of these types of anomaly starts to become frequent, this provides us with information, potentially, on the evolution of the underlying reality (for example, in the case mentioned, the existence of significant exemptions concerning the income ceiling). The next step is therefore to come back to the source to understand it better and take action with the data providers, or change the edit rules.

This technique, used in Belgium with the government entities, genuinely allows the quality of the data to be improved, by means of these pertinent exchanges with the providers, with "an ROI that reaches almost 50% in terms of the reduction in the share of presumed anomalies to be dealt with" (Boydens, 2018). The process has even gone much further, since the backtracking method was officialised in a Royal Decree.

Another way of looking at the statistician's job

In conclusion, administrative declarations offer many advantages for statisticians, in particular because of similarities in the production process, but, like administrative data in general, they have one major drawback: the loss of control by the statistical system, which no longer has decision-making powers over major aspects (scope, data collected).

To overcome this problem, or at least contain it, it is crucial for the statistician to work in a regular and structured way with the organisation concentrating the flows, going far beyond the simple provision of data: in the design of the data and the edit rules, the updating of shared reference documentation or the management of exchanges with the data providers, in the spirit of backtracking. This represents a cultural change, but it is a way of limiting the misunderstandings between the statistics world and the administrative world. Such cooperation can trigger a virtuous circle that is conducive to a sustained and controlled improvement in the quality of data – and the statistics produced from them.

The use of administrative data for statistics, seen here through the specific case of declarations and therefore the industrialised management of flows, therefore opens up new horizons for statistics, other ways of working, new jobs and even new ways of thinking about our activity, which oblige us to break away from the usual models.

Box - The standard statistical production process of a survey

Used as a reference by all the National Statistical Institutes (NSIs), the Generic Statistical Business Process Model (GSBPM)* provides a standard breakdown into activities based on 8 main phases: specify needs, design, build, collect, process, analyse, disseminate, evaluate. The GSBPM also mentions the processing of administrative data sources, but here we will present more specifically the process based on a survey, in an extremely simplified way, in its broad principles.

Let us begin with the basics. In order to carry out a survey, the public statistician essentially needs three things to get started: a legal framework, a list of the data he wants to collect and a list of the entities from which he can obtain them.

In France, a list of surveys is decided every year and published in an Order issued by the Ministry of the Economy and Finance, after the National Council for Statistical Information (CNIS) has given its approval ("avis d’opportunité") and the Quality Label Committee has declared it compliant, these two approvals officialising the survey and in particular formalising the obligation to respond.
The determination of the data that is of interest will mean having clarified its meaning and agreed on the information needs, on a suitable way of asking the question and whether or not it is even possible to obtain the data. Gradually, and by means of user committees in particular, a questionnaire is drawn up, then a collection medium decided upon (paper questionnaire, electronic questionnaire for the investigator to use, or on the internet…), and controls will also be added to the latter.
Defining what entities to question, first of all means clarifying the scope, which amounts to defining an intended set: for example, enterprises in the trade sector that were active in 2018, in Metropolitan France; or households living in border regions. Then an extended list is put together, which will be the survey frame (for example, the master sample, or any sampling frame taken from the SIRENE register), and after that the sample is drawn, which will correspond to all the entities that will actually be questioned.

Now that the three tools – the legal framework, the collection medium and the sample – are in place, the collection process can begin. This takes place over a clearly defined period, the statistical department contacting the collection units. Depending on the survey, the modes of data collection may differ (face-to-face, telephone, internet) and in some cases, be combined.

The data collected is subjected to automatic checks: for inconsistencies, which enables doubtful data or combinations of data to be picked up, for example a person who states both that they are retired and that they are 20 years old. The individual checks carried out afterwards are connected to these potential inconsistencies that are picked up automatically. They may require an interaction with the persons surveyed; directly during data collection by the investigators, afterwards for the managers, by telephoning or sending out repeat letters in the case of surveys of enterprises. It should be noted that the statistical collection process assumes that there will be non-responses, either total or partial.

After this, there is a processing and analysis phase: handling non-responses (imputation, re-weighting), extra checks (macro-checks in particular), tabulation, calculations of evolutions, etc. then come the publications and studies, i.e. the dissemination phase, and finally, evaluation.

^* See the Ouvrir dans un nouvel ongletEurostat site.

Paru le :06/12/2018

Imprimer

« Plunging response rates to household surveys worry policymakers », The Economist, 24 mai 2018.

Dealing with sample drawing, marginal calibration, variance calculation, treatment of non-response, etc.

Alain Desrosières wrote on this subject in 2004: "no general theory of error yet exists".

Desrosières (2004): "An administrative source comes from an institution whose purpose is not to produce such information, but whose administrative activities imply the keeping, according to general rules, of individual files or records, the aggregation of which is only a by-product".

Public Interest Grouping for the Modernisation of Social Declarations.

The PMSI (Information Systems Medicalisation Programme), which became mandatory in 1996, is intended to define the activity of public hospital departments in order to calculate their budget allocations.

National Service for Water Data and Reference-dataset Management.

Cf. the numerous governance structures associated with the DSN, and before that the DADS, which involved all types of social protection organisations, tax departments – and statistical offices.

Royal Decree of 2 February 2017 amending Section IV of the Royal Decree of 28 November 1969 implementing the Law of 27 June 1969 revising the legislative Order of 28 December 1944 concerning workers' social security.

Pour en savoir plus

Boydens I., "Informatique, normes et temps", Éditions Bruylant, 2000.

Boydens I., "Data Quality & "backtracking": depuis les premières expérimentations à la parution d’un Arrêté Royal", Smals Research, mai 2018.

Desrosières A., "Enquêtes versus registres administratifs : réflexions sur la dualité des sources statistiques", Courrier des statistiques, N°111, septembre 2004.

Elbaum M., "Les enjeux des nouvelles sources de données", Chroniques, N°16, CNIS, septembre 2018.

Hand D.J., "Statistical challenges of administrative and transaction data", J. R. Statist. Soc. A 181, Part 3, pp. 555–605, 2018.

Journées de méthodologie statistique 2018, session 20 ("Appariements – fichiers administratifs") http://jms-insee.fr/programmejms2018/.

Lyberg L., Biemer P., Collins M., De Leeuw E., Dippo C., Schwarz N., et Trewin D., "Survey Measurement and Process Quality", Wiley, 2012.

McCallum E. Q., "Bad data handbook", O’Reilly Media, 2012.

Redman T., "Data quality for the information age", Artech house, 1996.

Robben F., Desterbecq T. et Maes P., "L’expérience de la Banque-carrefour de la Sécurité sociale en Belgique.", revue des politiques sociales et familiales, n° 86 sur le thème de "La nouvelle administration. L’information numérique au service du citoyen", pp. 19-31, 2006.

Rouppert B., "Modélisation du processus de traitement d’une source administrative à des fins statistiques", document de travail SGI, Insee, 2005