Courrier des statistiques N6 - 2021

In this sixth issue, the Courrier des statistiques (Statistics Courier) examines four data sources, two methods and one institution, while remaining open to the outside world, both in France and abroad.

With the 2021 redesign, the Labour Force Survey is modernising its data collection methods and complying with European requirements. Fidéli, a demographic file on dwellings and individuals, has become indispensable, particularly as a pivotal tool for social studies. The permanent demographic sample, with its extended possibilities, brings temporal depth to the analysis of individual trajectories. Finally, the RGCU, a gigantic database on professional careers, designed by the main pension scheme in France (Caisse nationale de l’assurance vieillesse - CNAV), promises to become a valuable source for researchers.

But how can files be matched without a common identifier? The Directorate of Evaluation, Forecasting and Performance Monitoring (Direction de l’évaluation, de la prospective et de la performance - DEPP) presents its method, through its information system on the integration of young people into working life. Upstream, how can administrative databases be improved? To this end, Belgium has institutionalised and implemented an approach that favours preventive methods based on the analysis of anomalies.

The issue concludes by explaining how the National Council for Statistical Information (Conseil National de l’information Statistique - CNIS) organises dialogue between users and producers of official statistics, to ensure the relevance of statistical outputs and to improve them.

Courrier des statistiques
Paru le :Paru le02/10/2023
Loïc Midy, Director of the Measuring Youth Integration Project (Mesure de l’insertion des jeunes), DEPP
Courrier des statistiques- October 2023
Consulter

A matching tool using indirect identifiers The example of the information system on the integration of young people (système d’information sur l’insertion des jeunes)

Loïc Midy, Director of the Measuring Youth Integration Project (Mesure de l’insertion des jeunes), DEPP

The French Statistical Office of the National Education Ministry (the DEPP) carry out two surveys on the labour market integration of the students who just finished their study as apprentice or in vocational school path. But they don’t enable to publish statistics at the establishment level as required by the 2018 Act for the Liberty to choose one professional future. So the DEPP and the DARES (Statistical Office of the Labour Ministry) have designed a new information system, InserJeunes, based on the record linkage of administrative data sources. Record linkage is central in this device, from the methodological, the algorithmic and the IT development standpoints. In InserJeunes, the record linkage process has five steps: data normalisation, indexing, similarities calculation, supervised classification and quality evaluation. The methods are presented through a real production example from the InserJeunes information system. They were implemented through a record linkage tool developed by the InserJeunes team, which can be reused for other record linkage processes.

Better understanding of the effectiveness of establishments in integrating young people

The guidance of pupils is built up throughout their schooling, with key stages at the end of year 10, year 11 and year 13. Thus, career guidance can begin as early as the end of the year 10, with a choice between an apprenticeship or school-based vocational training. As integration in employment is the primary objective of vocational training, knowing the integration rates of initial training courses makes it possible to inform the choices of young people and their families.

Since the early 1990s, the Directorate of Evaluation, Forecasting and Performance Monitoring (direction de l’Évaluation, de la prospective et de la performance – DEPP) has conducted making it possible to monitor the entry into active life of those leaving apprenticeships and school-based vocational training. These operations provide valuable information, but do not allow for the publication of statistics at establishment level, given the .

However, provides for the publication of statistics by establishment on the educational pathway and integration into employment of young people in vocational training. In order to meet this need, DEPP and the Directorate of Research, Economic Studies and Statistics (direction de l’Animation de la recherche, des études et des statistiques – DARES) created this : InserJeunes is based on the matching of comprehensive administrative sources relating to the schooling of pupils and apprentices, exam success, apprenticeship contracts and salaried contracts of the Nominative Social Declaration (déclaration sociale nominative). The first results were disseminated in early February 2021.

On the national education side, the pupil databases can be matched using a specific identifier, the National Pupil Identifier (identifiant national élève - ). However, there is no common identifier to match the “schooling” databases of pupils and apprentices with the salaried contracts of the DSN. These matching operations can therefore only be performed indirectly, based on the five variables of surname, forename(s), date and place of birth and sex. The introduction of an efficient matching tool based on indirect quality identifiers is therefore a key challenge for InserJeunes. This issue, which is well known to statisticians, is the subject of a vast amount of literature (see for example (Ouvrir dans un nouvel ongletKilss and Alvey, 1985)).

This article presents the overall approach adopted, the main sources used, the legal framework to be respected and the choices made between the different methods and digital matching tools based on indirect identifiers.

The principles of the InserJeunes system

The main process is structured around several phases (Figure 1). Initially, for a given school year, the field of students in the final year of training is calculated by using , each covering part of the InserJeunes field: apprenticeships, the school-based vocational route in an establishment of the French Ministry of National Education and the school-based vocational route in an establishment of the French Ministry of Agriculture. These databases contain the indirectly identifying variables, as well as the INE, and information on the establishment and the training followed.

In a second phase, the coverage of people leaving training is established, i.e. those who are no longer in training. To do this, we search, mainly on the basis of the INE, whether these pupils are still present the following school year in all the available pupil databases, i.e. the three databases already used in the previous phase as well as , in order . . This makes it possible to establish study continuation rates.

In the third phase, the pupil/apprentice databases are enriched with their exam passes (depending on the case, this matching is carried out using the INE or indirect identifiers), which makes it possible to calculate the rate of interruption during training.

Finally, in the fourth phase, the databases of those leaving school/apprenticeships are , which makes it possible to measure a rate of salaried employment in France for those leaving training then the . The DSN contains detailed information on the employee contracts (type of contract, wage, working time, socio-professional category, etc.) as well as on the employing establishment (sector, commune in which it is located, etc.): as a result, InserJeunes can also be used to carry out statistical studies, for example, on training/job suitability.

In InserJeunes, the matching rate does not provide any indication of the level of quality of the process. For example, when a person leaving training is not matched with the DSN, it is not possible to know whether this is because they are not in salaried employment or because of an error in the matching process. However, the system includes an annual matching based on indirect identifiers called “quality matching”, for which the theoretical rate is 100%: this involves reconciling the file listing with the DSN. Thus, the actual matching rate obtained is an indicator of the level of quality of the matching process.

The InserJeunes statistical process involves a total of ten matching operations based on indirect identifiers for each school year. The issue of matching based on indirect identifiers is therefore key. This requires developing a general matching process and then implementing it digitally in a generic and fast manner.

Figure 1. The sources of the InserJeunes system

 

At the heart of the information system: a five-step matching process

In InserJeunes, , without double counting. The matching process chosen for InserJeunes (Figure 2) consists of five successive steps, as is also the case in Peter Christen’s presentation (Figure 3) (Christen, 2012).

First of all, the data are standardised. Then comes the indexing stage, which consists of establishing a reasonable sized list of “potentially interesting” pairs. A pair corresponds to the cross-tabulation of a row from the first table with a row from the second table. Each pair therefore consists of a/some surname(s), a/some forename(s), a date of birth, a place of birth and a sex variable from each of the two tables being matched. Thirdly, a similarity is calculated for each of the five pairs of indirect identifiers in each pair (e.g. pair of names, pair of dates of birth). Fourthly, each pair is classified: pairs assumed to be from the same individual (i.e. where the five similarities calculated in the previous step are sufficiently high) are accepted and the others are rejected. Finally, the quality of the matching process is assessed.

Figure 2. The process for matching two tables

 

Standardising the data

The indirect identifiers used in the matching are presented in heterogeneous formats in the various sources used in InserJeunes. Data standardisation, the first step in the matching process, consists in recoding the data according to a common structure in order to facilitate further processing.

For surnames and forenames, the following main processing operations are carried out:

  • ;
  • deletion of special characters;
  • ;
  • deleting one-letter and some two- and three-letter surnames/forenames, so as to remove uninformative terms such as DE and LE.

Furthermore, date, month and year of dates of birth are stored in different variables in order to be able to perform calculations on birth dates even when they are only partially filled in.

The sources used in InserJeunes are of good quality. Indeed, there are no duplicates in the main sources, there are very few missing values for the indirect identifiers, and the of the commune of birth, which is much more precise than the name of the commune, is generally provided. This is due to the fact that, for all pupils, registration in the National Register of Pupil, Student and Apprentice Identifiers has already required all identifying variables to be provided. Similarly, each employee in the DSN source is subject to a certification procedure for the associated , which also ensures a high level of quality of the identifying variables.

Indexing the data: the naive approach

For the second step, that of indexing the data, an initial naive approach consists in analysing all the possible cross-tabulations between the two tables. However, the processing time increases quadratically with the number of observations in the tables to be matched and therefore, in practice, this method is no longer applicable beyond a certain threshold.

In the case of quality matching, about 315,000 apprentices are reconciled to the 7.5 million employees with an active contract in December of the year under consideration. .

Furthermore, there is no point in analysing all the pairs. Indeed, where the surnames or forenames or dates of birth are very different, it is extremely unlikely that the pair will be accepted.

The objective of the indexing stage is to establish a reasonable sized list of “potentially interesting” pairs. The indexation method chosen must combine two apparently contradictory objectives: drawing up a list of pairs that is as small as possible, while ensuring that as many pairs as possible relating to the same individual are included in the list.

Indexing the data: the traditional blocking key approach

The most common method used to index data is to keep only those pairs that share the same modality of one or more indirectly identifying variables, which are called blocking keys (Christen, 2012; Jabot and Treyens, 2018). For example, if the blocking key is the code of the commune of birth, this means that only pairs sharing this code will be retained.

This method was not chosen for InserJeunes, as it has several disadvantages. First of all, it produces a list of “potentially interesting” pairs that is still too large. Furthermore, it leads to some pairs being discarded incorrectly. For example, if the blocking key is the code of the commune of birth, and if this variable has been entered incorrectly by an individual, then that individual will never be matched. Finally, it is not possible to apply a blocking key to surnames or forenames, as the slightest typing or spelling error will result in a different blocking key. To solve this problem, the surnames and forenames can indeed be replaced with their phonetic version. There are numerous phonetic algorithms that can be used for this purpose (Box 1). Let’s use the example of a pair with the names “christina” and “kristina”. These two forenames will have the same phoneticised version when using Phonex (i.e. c623). In contrast, “peter” and “pedro” have the same phoneticised version when using Soundex (i.e. p360), which means that, depending on the algorithm, we will still have too large a list of “potentially interesting” pairs. Moreover, these phonetic algorithms were initially developed for the English language and not all of them have been adapted for the French language.

Box 1. Introduction to Phonetic Algorithms

A phonetic algorithm is an algorithm designed to index words based on their pronunciation. For example, the Soundex algorithm operates as follows:

1. It retains the first letter in the string.

2. It deletes all appearances of the letters: a, e, h, i, o, u, w and y (unless it is the first letter of the name).

3. It assigns a numerical value to the remaining letters in the following manner (in the version for English-language names):

  • 1 = B, F, P, V
  • 2 = C, G, J, K, Q, S, X, Z
  • 3 = D, T
  • 4 = L
  • 5 = M, N
  • 6 = R

4. If two (or more) letters with the same number are adjacent in the original name, or if there is only an h or a w between them, then only the first of these letters is retained.

5. It returns the first 4 elements. If there are fewer than 4 elements, they are supplemented with zeros.

Here are some examples of the application of phonetic algorithms:

 

 
Initial name Phonetised name
Soundex Phonex NYSIIS Double Metaphone
christina c623      c623      chra      krst     
kristina k623      c623      cras      krst     
peter p360      b360      pata      ptr     
pedro p360      b360      padr      ptr     

Figure 3. The Five Steps Defined by Christen

 

Indexing the data: the approach chosen for InserJeunes

Taking into account the drawbacks of the traditional blocking key approach, a specific indexing method has been developed for InserJeunes.

First, an exact match between the two tables is made for all the following fields: first surname, first forename, date, month and year of birth, code of the commune of birth and sex. As part of the quality matching, this step matches approximately 84% of apprentices with the DSN source. Once this is done, only about 50,000 apprentices remain to be matched with 7.2 million employees with an active contract in December. The volume of work is thus already .

Second, the union (without duplicates) of the following three lists of pairs is established:

  • pairs that have a small distance between the first surnames, a small distance between the first forenames, the same department of birth and the same year of birth;
  • pairs that have the same date of birth and the same department of birth;
  • pairs that have the same first surname and first forename.

InserJeunes uses the Levenshtein distance for surnames and forenames. This corresponds to the minimum number of characters that must be deleted, inserted or replaced to move from one surname/forename to another. The union of the different queries allows us to cover all frequently encountered cases and thus to keep almost all “potentially interesting” pairs. Moreover, as each query is relatively precise, the number of “potentially interesting” pairs retained is not too high. For quality matching, this indexing method leads to the retention of 1 million pairs, which is a reasonable number that can be processed sufficiently quickly in the later steps of the process.

This indexing method works very well on the volume of InserJeunes data but might not be suitable for matching tables of several tens of millions of rows.

Calculating similarities

The third step consists of enriching each of the pairs determined during indexing, with five “similarities” variables relating to surname, forename, date and commune of birth and sex. Each similarity is a measure of the degree of similarity of the indirect identifiers considered. Three different forms of similarity are used in InserJeunes, depending on the nature of the variables used as indirect identifiers.

First of all, the Jaro-Winkler similarity is implemented for surnames and forenames. The latter is an adaptation of the Jaro similarity developed by the statistician Winkler, which adds a “bonus” when the two strings being compared begin with a common prefix. The algorithm for calculating the Jaro similarity between two strings of characters is as follows:

  • the first step is to calculate a Jaro distance which is equal to the length of the longest chain divided by 2 minus 1 (e.g. if we compare DWAYNE and DUANE, the distance is 2);
  • then, we establish the list of matching characters, i.e. the characters found in the two strings with a distance of less than or equal to the value calculated previously (if we compare DWAYNE and DUANE, the matching characters are D, A, N and E);
  • we must then calculate the number of transpositions between the matching characters, i.e. the number of times (divided by two) that the ith matching character in the first string is different from the ith matching character of the second string (in the above example, the number of transpositions is 0);
  • finally, the Jaro similarity is calculated as the weighted sum of the following three terms:
    • the number of matching characters divided by the length of the first string of characters (which is 4/6 in our example);
    • the number of matching characters divided by the length of the second string of characters (which is 4/5 in our example);
    • the number of matching characters minus the number of transpositions, divided by the number of matching characters (which is 1 in our example). The Jaro similarity in this case is therefore 1/3 × 4/6 + 1/3 × 4/5 + 1/3 × 1 = 0.822.

There are numerous other ways of measuring the similarity between surnames and forenames. For example, some are based on the comparison of between strings of characters, such as the Jaccard similarity coefficient. However, there does not appear to be a way that gives significantly better results than the Jaro-Winkler similarity for surnames and forenames (Ouvrir dans un nouvel ongletChristen, 2006).

Next, an InserJeunes-specific similarity was developed for birth dates. For example, when two dates of birth differ only in respect of the day of birth, the similarity is 0.9 if the difference is only in one of the two digits and 0.8 otherwise. If both dates of birth have the same year and the day of one date corresponds to the month of the other and vice versa, example:

then the similarity is 0.65.

Finally, for the sex variable, a binary similarity is used. For the commune of birth, the similarity is 1 when the COGs are identical. If they are different, the similarity is 0.5 if the department code is identical and 0 otherwise.

Classifying the pairs: a matter of machine learning?

The fourth step is to make a decision regarding each pair, using the similarities calculated in the previous step. When the similarities are high, i.e. close to 1, the pair is accepted. Otherwise, the pair is rejected.

A simple initial approach entails calculating an overall similarity for each pair, a strictly increasing function of the similarities of the different fields. The pairs with an overall similarity above a certain threshold are accepted, with the others being rejected. The function and threshold are selected empirically, by analysing a sample of pairs for which the status (accepted or rejected) has been annotated manually. This method has the benefit of simplicity; however, the selection of the function and the threshold remain arbitrary and, therefore, there is no guarantee that these choices are optimal.

In view of the limitations of the first approach, classifications using supervised machine learning algorithms have been tested (random forests and support vector machines (SVMs) or wide margin separators (Box 3)). The approach consists of training the algorithm on a sample of pairs the status of which has been entered manually, and then applying it to the other pairs. The optimal parameters of each algorithm are determined by cross-validation by maximising the f-measure metric.

In the case of quality matching, the simple approach, random forests and SVMs all gave similar and excellent results so, ultimately, rough classification was selected for InserJeunes.

How can this result, which is surprising in principle, be explained? One way of presenting our problem is to view each pair as a point in a 5-dimensional space, with the dimensions being those of the 5 similarities (surname, forename, date of birth, commune of birth and sex). As the sex variable is not very selective, it could be eliminated from the analysis, which would reduce the space to 4 dimensions. In each dimension, the similarities are assigned values between 0 and 1.

The problem is therefore to find a separation boundary between accepted and rejected points/pairs in a space [0; 1]⁴, i.e. a space of very small size. Furthermore, the area “near” to the point (1,1,1,1) is the area in which almost all the pairs to be accepted are located. Thus, the points corresponding to the pairs to be accepted are not too “mixed in” with the points corresponding to the pairs to be rejected. It is therefore relatively easy to solve this type of problem, which explains why all the methods tested give equally very good results.

Probabilistic classification, originally developed by (Fellegi and Sunter, 1969), is another method developed specifically for matching using indirect identifiers and frequently cited in the literature. This method has not been investigated by the InserJeunes team, as it is not possible, to the best of our knowledge, to implement it quickly once the volume of data is quite large.

Box 2. The Legal Approaches

Matching operations based on indirect identifiers are often carried out on personal data. However, since 2018, use of such data has been governed by the General Data Protection Regulation (GDPR)*.

The InserJeunes system has therefore been declared to the register of data processing operations monitored by the Data Protection Officer of the French Ministry of Education, in accordance with Article 30 of the Regulation. A data protection impact assessment (DPIA) has also been carried out. Carrying out a DPIA is mandatory, firstly, for certain types of processing (this list has been drawn up by the CNIL**) and, secondly, when at least two of a list of nine criteria apply to the processing. InserJeunes meets the following three criteria:

  • large-scale collection of personal data;
  • cross-referencing of data;
  • and vulnerable people (patients, elderly people, children, etc.).

Broadly speaking, a DPIA consists of three parts:

  • a description of the processing used;
  • an assessment of the necessity and proportionality of personal data collection;
  • an analysis of security risks and their potential impact on privacy.

The GDPR requires that the collection and retention of personal data be limited to what is strictly necessary for the purposes of the processing.

The InserJeunes team also applied to the CNIS for access to the sources of the Directorate-General for Education and Research*** and DARES used in the system, pursuant to Article 7bis of the French Law of 1951*.


* See the legal references at the end of the article..

** See Ouvrir dans un nouvel onglethttps://www.cnil.fr/sites/default/files/atoms/files/liste-traitements-aipd-requise.pdf.

*** A Directorate of the French Ministry of Agriculture..

Assessment to confirm the choices made

Given the key nature of matches using indirect identifiers in InserJeunes, it was appropriate to assess their quality. This is the fifth and final step in the process.

The assessment requires having a sample of pairs the status of which (accepted or rejected) has been annotated manually and which has not been used in the classification step. For this sample, the prediction resulting from the supervised classification is compared with the true status of the pair, i.e. the one established manually, which makes it possible to obtain four quantities initially:

  • true positives (TPs);
  • false positives (FPs);
  • true negatives (TNs);
  • and false negatives (FNs).

For example, a false negative pair is a pair that has been rejected by the classification algorithm but accepted by the human who performed the annotation. Based on these four quantities, it is possible to establish several measures of overall quality.

The best known measure is accuracy, which is (TPs+TNs)/total number of pairs. However, like any measure that uses true negatives, it is not suitable. Why? Because the data is unbalanced: there are many pairs the true status of which is rejected and few pairs with a true status of accepted. In the case of quality matching, about 40,000 pairs are accepted in 1 million pairs, meaning that at least 950,000 pairs have the true status “rejected”. A naive classifier that rejects 100% of pairs therefore has an accuracy of at least (0+950,000)/(1,000,000) or 95%.

Three other measures have therefore been selected in the InserJeunes system:

  • precision, which is TPs/(TPs+FPs). For example, if the precision is 80%, that means that 80% of the accepted pairs are accepted correctly;
  • recall, which is TPs/(TPs+FNs): if the recall is 90% then this means that 90% of the true pairs were detected by the classification algorithm;
  • and the f-measure, which is the harmonic mean of precision and recall: it is therefore 2 × (precision × recall)/(precision + recall).

In the case of the quality matching, the precision is 95% and the recall is 99%. In total, 97% of apprentices are matched in quality matching (84% through direct matching and 13% through best matching), a matching rate close to the theoretical rate of 100%.

Digital implementation: the selection of one specific tool

Several matching software packages were tested as part of the InserJeunes project: FEBRL by Peter Christen, matchID and two R libraries. In R, the best implementation seems to be the R library fastlink, but it takes about 8 hours to process the matching of tables with 300,000 rows and it relies on the probabilistic classification of Fellegi and Sunter. Only the matchID tool met the needs, but its implementation proved to be relatively complex.

The InserJeunes team also studied the literature on other software. The Italian national statistics institute Istat has developed a tool called which uses, in particular, probabilistic classification; but for 100,000 observations or more, the processing time is about 1.25 hours (Eurostat, 2009). In the United States, the Census Bureau has developed the bigMatch tool specifically to process large volumes; however, it seems to perform only the data indexing phase and it is written in C, which makes it difficult to integrate with tools.

As a result of this comparative work, it was decided to develop a specific tool that would meet four major needs of InserJeunes:

  • the matching operations must be fast: the InserJeunes tool performs quality matching in 15 minutes;
  • the matching tool must be generic, i.e. easily adaptable to all cases of matching using indirect identifiers. To do this, the specification for each matching operation (the fields compared, the similarity method chosen for each field, etc.) is described in the XML which is then interpreted by the tool. This implies that the statistician or data scientist who produces the XML must respect a type of formal grammar: they must describe their matching according to a formalism and but which is very strongly inspired by the one presented in the work published by Peter Christen (Christen, 2012). Working in this manner also ensures full traceability of each matching operation, as the XML specifications are all saved;
  • given that the assessment of the matching process is essential and requires a sample of manually annotated pairs, an ergonomic pair annotation interface has been developed;
  • the tool is based on several open source libraries, which has made it possible to accelerate its development () and to facilitate its maintenance.

The InserJeunes matching tool will be made available in open source in summer 2021. However, for matching projects concerning much larger tables, the matchID tool seems to be more suitable, particularly because it performs the indexing through elastic search queries and not through SQL queries.

Sharing the experience gained during the project

In view of the experience gained during the InserJeunes project, what lessons can be learned?

Firstly, it is clear that the overall quality of the matching process is very highly dependent on the quality of the indirect identifier variables. The context of InserJeunes was favourable, as the variables are provided correctly in the databases used. Respecting each of the steps is indeed essential to the success of the operation: correctly standardising the data in order to facilitate subsequent processing, spending time determining the best way to calculate similarities (this work is called “feature engineering” in machine learning) in order to significantly increase the quality of the classification, none of these activities is superfluous. The assessment, when based on a sample of manually annotated pairs that were not used in the classification stage, makes it possible to guarantee the overall quality of the process and, in particular, to check that there has been no overfitting. For InserJeunes, the ability to carry out annual “quality matching” is an opportunity, but not all information systems will be suitable for this.

From an IT point of view, the fact that it is based on several open source libraries has enabled developments to be carried out within a short time-frame. The decision to describe each matching specification in XML, respecting a specific matching language, made it possible to specify and then quickly integrate into the production chain all the matching operations needed by InserJeunes. .

Box 3. Brief Introduction to Supervised Classifications in Machine Learning

The machine learning supervised classification algorithms are first trained using labelled data, i.e. for which the variable to be predicted is known (in the case of InserJeunes a binary qualitative variable). The general calibration of the algorithm is retained, which maximises a statistical quantity to be determined and depends on the problem addressed. In this case, it is the f-measure metric that is maximised. Then, the previously trained model is applied to new data for which the variable to be predicted is unknown. The main methodological challenge is to ensure that the algorithm has “learned correctly” during training so that it can then make correct predictions on the new data.

The supervised classification process is illustrated using the support vector machine (SVM) algorithm applied to a simplified example, in which there are only two dimensions (Figure A).

 

The first approach involves choosing the boundary (the blue and red lines) with the widest possible margin, i.e. we want all the points of one colour to be on one side of the boundary and the points of the other colour to be on the other side, and we also want to maximise the size of the no man’s land between the two lines, i.e. the area in which there are no points. This approach is also called “wide margin separator”.

However, for some datasets this type of boundary does not exist. Furthermore, if the algorithm is required to separate 100% of the points, then it will “stick” too closely to the training data. There is then a risk of overfitting and making incorrect generalisations regarding the new data (Figure B).

 

To avoid this pitfall, it is necessary to accept that the SVM does not correctly classify a small percentage of the pairs. It is also necessary to assess the quality of the algorithm using labelled data that have not been used in the learning stage, which makes it possible to check that there has been no overfitting.

Legal references

Paru le :02/10/2023

The Integration into Active Life (Insertion dans la vie active – IVA) survey, for those leaving school-based vocational training, and the Professional Integration of Apprentices (Insertion professionnelle des apprentis – IPA) survey.

Of around 60%.

See the legal references at the end of the article.

InserJeunes has received financing from the Public Action Transformation Fund.

For further information on the DSN, see (Humbert-Bottin, 2018).

The INE, introduced in 2017, is a unique identifier for each pupil.

The Apprentice Training Information System (Système d’information de la formation des apprentis – SIFA) for apprentices, the Consolidated Academic Statistical Information System (Système d’information statistique consolidé académique – SYSCA) for school-based vocational students of the French Ministry of National Education and DeciEA for school-based vocational students of the French Ministry of Agriculture.

SIFA, SYSCA and DeciEA, plus pupils in the private sector without a contract with the SCOLEGE source and higher education via the Student Monitoring Information System (Système d’information sur le suivi des étudiants – SISE) and the wishes validated via Parcoursup in a nursing training institute.

In particular, taking into account the continuation in higher education.

In reality, some of those leaving training may in fact still be studying: for example, we do not identify the continuation of studies abroad.

To be precise, the Declaration of Movement of Labour (Déclaration de mouvement de main d’œuvre – DMMO) source, based on the Nominative Social Declaration (Déclaration sociale nominative – DSN), is used.

The notion of added value is a concept that has been discussed extensively in a previous article, see (Evain, 2020).

Outside the civil service, because public employment is not yet integrated into the DSN.

In other words, the observation unit is the individual (in this case, the pupil or apprentice).

which also removes the accents.

For example: Ç becomes C and Ï becomes I.

The Official Geographical Code (Code officiel géographique - COG) identifies each commune in France.

Registration number in the National Directory for the Identification of Natural Persons (Répertoire national d’identification des personnes physiques – RNIPP).

If each row was analysed very quickly, say in one hundred thousandth of a second, the total processing time would be 273 days. Even running the calculations on a number of cores in parallel, this would still take too long.

I.e. 315,000/50,000.

For example, the bigrams of DWAYNE are DW, WA, AY, YN and NE and the trigrams of DUANE are DUA, UAN and ANE.

As part of a project of the Entrepreneurs of General Interest (Entrepreneurs d’Intérêt Général – EIG) programme.

See (Ouvrir dans un nouvel ongletEnamorado, Fifield et Imai, 2019) page 362 Figure 3 Running Time Comparison.

For further details, see (Ouvrir dans un nouvel ongletIstat, 2020).

The project team has no knowledge of any machine learning libraries written in C, and bringing together bricks written in different languages requires a more substantial investment.

A specific language was thus created at that time for the matching domain.

The indexing is carried out in SQL language on a PostgreSQL database, using the fuzzystrmatch module, the calculation of Jaro-Winkler and Levenshtein similarities uses the Python jellyfish library and the machine learning algorithms are created with the Python scikit-learn library.

Pour en savoir plus

CHRISTEN, Peter, 2006. Ouvrir dans un nouvel ongletA Comparison of Personal Name Matching: Techniques and Practical Issues. [online]. Septembre 2006. The Australian National University Research Publications. Joint Computer Science Technical Report Series, TR-CS-06-02. [Accessed 27 mai 2021].

CHRISTEN, Peter, 2012. Data matching. Concepts and techniques for record linkage, entity resolution and duplicate detection. 4 July 2012. Springer. ISBN 978-3-642-31163-5.

COLLIN, Christel et MARCHAL, Nathalie, 2021a. Ouvrir dans un nouvel ongletSix mois après leur sortie en 2019 du système éducatif, 41 % des lycéens professionnels sont en emploi salarié. [online]. February 2021. DEPP-MENJS. Note d’information n°21.06. [Accessed 27 mai 2021].

COLLIN, Christel et MARCHAL, Nathalie, 2021b. Ouvrir dans un nouvel ongletSix mois après leur sortie en 2019 du système éducatif, 62 % des apprentis de niveau CAP à BTS sont en emploi salarié. [online]. February 2021. DEPP-MENJS. Note d’information n°21.07. [Accessed 27 mai 2021].

COLLIN, Christel et MARCHAL, Nathalie, 2021c. Ouvrir dans un nouvel ongletDes lycéens professionnels et des apprentis mieux insérés 12 mois après leur sortie d’études en juillet 2020 que 6 mois après, malgré la crise. [online]. Mai 2021. DEPP-MENJS. Note d’information n°21.24. [Accessed 27 mai 2021].

ENAMORADO, Ted, FIFIELD, Benjamin et IMAI, Kosuke, 2019. Ouvrir dans un nouvel ongletUsing a Probabilistic Model to Assist Merging of Large-Scale Administrative Records. In: American Political Science Review. [online]. N°113, 2, pp. 353-371. [Accessed 27 mai 2021].

EUROSTAT, 2009. Ouvrir dans un nouvel ongletInsights on Data Integration Methodologies. [online]. ESSnet-ISAD workshop, Vienne, 29-30 mai 2008, page 53. [Accessed 27 mai 2021].

EVAIN, Franck, 2020. Indicateurs de valeur ajoutée des lycées. Du pilotage interne à la diffusion grand public. In: Courrier des statistiques. [online]. 31 December 2020. Insee. N°N5, pp. 74-94. [Accessed 27 mai 2021].

FELLEGI, Ivan P. et SUNTER, Alan B., 1969. A theory for record linkage. In: Journal of the American Statistical Association. December 1969. Taylor & Francis Ltd.. Volume 64, n°328, pp. 1183-1210.

HUMBERT-BOTTIN, Élisabeth, 2018. La déclaration sociale nominative. Nouvelle référence pour les échanges de données sociales des entreprises vers les administrations. In: Courrier des statistiques. [online]. 6 December 2018. Insee. N°N1, pp. 25-34. [Accessed 27 mai 2021].

ISTAT, 2020. Ouvrir dans un nouvel ongletRELAIS (Record Linkage At Istat). In: site de Istat. [online]. 19 November 2020. [Accessed 27 mai 2021].

JABOT, Patrick et TREYENS, Pierre-Eric, 2018. Appariement de l’enquête Care par identification du plus proche écho. In: site des 13es Journées de méthodologie statistique de l’Insee (JMS). [online]. 12-14 June 2018. [Accessed 27 mai 2021].

JAMES, Gareth, WITTEN, Daniela, HASTIE, Trevor et TIBSHIRANI, Robert, 2013. An introduction to statistical learning with applications in R. Springer. ISBN 978-1-4614-7138-7.

KILSS, Beth et ALVEY, Wendy, 1985. Ouvrir dans un nouvel ongletRecord Linkage Techniques − 1985. [online]. 1er December 1985. Workshop on Exact Matching Methodologies, Arlington, Virginia, May 9-10, 1985. [Accessed 27 mai 2021].