Docs Testvoorzieningen Synthetic Data Generator

Synthetic Data Generator

This project is intended to gather all base components of the Synthetic Data Generator v2 product, for ease of development.

This product is designed to be extendable, so it’s possible that not all domain-specific simulators and interpreters are kept in this project. However, it should be trivial to (temporarily) add them here for development purposes.

Overview

The project contains of 3 base components and 5 implementation specific components. The base components are generic and are able to handle all sorts of data. The current implementation specific components are build to handle a subsection of the BRP.

System Context diagram for synthetic data generation

Base components

The base components consists of an orchestrator, claims-backend and a nats instance.

Orchestrator

The orchestrator controlles and tracks all components. It exists of a backend and frontend. The frontend can control the system be starting, seeding or resetting it. The backend will propagate these commands into the system through api calls or messages in the nats queue.

Claim backend

The claim backend is a datastorage based on claims. A claim contains one bit of value such as a name or birthday. More information about the claim-backend can be found: here

Nats

Nats is a messaging system that can keep track of messages, objects, streams and the subjects that wants to receive them.

Implementation specific components

The implementation specific components exists of a seeder, simulator, interpreter, backend and frontend.

Seeder

The seeder will jumpstart the simulation and will make sure that it starts with data. For the FRP this means that people are created to make sure new people can be born.

Simulator

The simulator can created, change or delete objects in the simulation. It receives time movement from the orchestrator and can act upon it.

Interpreter

The interpreter will receive events from the simulator and can translate it into claims for the claim backend.

Backend

The backend can pull data out of the claim backend en translate it into a model that can be used.

Frontend

The frontend can display the data coming out of the backend

Note

The current base components claim backend can be replaced by any sort of data storage. We have choosen to use this kind of data storage in combination with a different project that is currently being developed on the digilab platform. This project is called uit betrouwbare bron or in short UBB. More information can be found: here.

Usage

The current frp-implemenation can be used through a web app and an API. The web app is accessable on: link. You can click through the web pages and view information about generated personage. It is also possible to access the API. The API is available on: https://sdg-frp-backend.apps.digilab.network. The openapi spec of the api is accessable on: /openapi.yaml. Using curl a request would look something like:

curl --request POST 'https://sdg-frp-backend.apps.digilab.network/v0/personen' --header 'Content-Type: application/json' --data-raw '{
    "type": "RaadpleegMetBurgerservicenummer",
    "burgerservicenummer": ["819366080"],
    "fields": ["burgerservicenummer"]
}'

response:

{
  "type": "RaadpleegMetBurgerservicenummer",
  "personen": [
    {
      "burgerservicenummer": "819366080"
    }
  ]
}

The API is based on the official haal-centraal specs, the spec can be found here.

Simulation parameters

To start the simulation we need some parameters for the simulator to known when to create new objects and what to do with them. The FPR simulator exists of two events and twelve datasets. The events consists of a birth event and a death event. This will make sure that we control the amount of our population by increase or decrease the birth or death event.

The datasets being used are:

First name females (src: https://www.svb.nl/nl/kindernamen/namen/meisjes-populariteit. Note: currently only recently popular names are used, generated names can be improved by also using historical names)
First name males (src: https://www.svb.nl/nl/kindernamen/namen/jongens-populariteit)
First name non-binary (src: https://www.babynamen.nl/namen/unisex/#h-de-100-populairste-unisex-namen)
Family names (src: https://nl.wikipedia.org/wiki/Lijst_van_meest_voorkomende_achternamen_van_Nederland (original source: https://cbgfamilienamen.nl/nfb/documenten/top100.pdf). IMPROVE: add number of occurrences per name and use this for random name selection)
Genders (src: according to https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/rapporten/2023/07/07/tk-bijlage-1-onderzoeksrapport-rutgers-kiezen-voor-een-x/tk-bijlage-1-onderzoeksrapport-rutgers-kiezen-voor-een-x.pdf, ~3.7% of respondents is (very) likely to set the gender to “X” when possible via the municipality and another 0.4% did not want to answer this question. We therefore use a guesstimate of 3.7% + 0.4%/2 = 3.9% as seed weight for gender X and assume that the other genders are divided according to the data from https://www.cbs.nl/nl-nl/visualisaties/dashboard-bevolking/mannen-en-vrouwen (measured on January 1st, 2022). IMPROVE: improve these weights when more data is available?)
Municipalities with the amount of babies born (src: https://www.cbs.nl/nl-nl/maatwerk/2024/10/voorlopige-bevolkingsaantallen-per-gemeente )
Street names
House letters (src: https://www.kadaster.nl/-/kosteloze-download-bag-2-0-extract)
House number additions
Birth places abroad (src: https://publicaties.rvig.nl/Landelijke_tabellen/Landelijke_tabellen_32_t_m_61_excl_tabel_35/Landelijke_Tabellen_32_t_m_61_in_csv_formaat/Tabel_34_Landen_gesorteerd_op_omschrijving and https://en.wikipedia.org/wiki/List_of_largest_cities#List, based on UN DESA 2018 counts/estimations)
Birth probability (src: https://opendata.cbs.nl/#/CBS/nl/dataset/37201/table?dl=94E25 and https://opendata.cbs.nl/#/CBS/nl/dataset/03759ned/table?dl=94E28)
Death probability (src: https://opendata.cbs.nl/statline/#/CBS/nl/dataset/37360ned/table?fromstatweb)

With just the datasets it is hard to get to the exact probility numbers that are used in the simulation. These numbers can be found below.

Birth probability

We take the probability, given the age of a female* person in years, that this person will bear a living child in one year time. Interpolated with Gaussian Bell curve fit (e.g. using https://mycurvefit.com/) This results in the following data points:

age	Probability
12.5	0.00020557
22.5	0.00858504
27.5	0.03612038
32.5	0.05931208
37.5	0.03316359
42.5	0.00722938
60	0.00219446

Resulting parameters: mean = 32.29439, standard deviation = 4.904455, factor ‘a’ = 0.05897603

Here ‘female’ denotes someone who is, when having a child, indicated as ‘mother’ by the CBS

Death probability

The probability of death is based on this dataset: https://opendata.cbs.nl/statline/#/CBS/nl/dataset/37360ned/table?fromstatweb We take the 2022 data and use the average of male and female subjects. Then, since people in general do not reach a higher age than 120 years (https://en.wikipedia.org/wiki/List_of_the_verified_oldest_people), we add an extra data point of a probability of death of 1 at 120 years. This results in the following data points for the probability of death within a year, given the person’s age in years:

Age	Probability
0	0.00445
21	0.00032
61	0.005845
81	0.048975
120	1

With curve fitting (e.g. using a tool like https://mycurvefit.com/), assuming a power curve, we obtain the following equation: y = 1.119401e-16 * x^7.671776

Note: The current probability does not compensate for COVID-19 deaths, see https://www.cbs.nl/nl-nl/nieuws/2024/06/sterfte-in-2023-afgenomen for some details Note: An alternative would be to define a random life span per agent at birth and at each step to check whether or not the life span is reached. However, little information seems to be available about the life span standard deviation, except that is was ‘approximately 1 year’ around 2008

Running the project

The project can be run by following the guide in the CONTRIBUTING.md file.

Developer documentation

If you would like to contribute to this project, consult the CONTRIBUTING.md file.

Deployment

The project can be deployed by using the deployment files in the deploy folder. The current deployment is setup to create a k8s cluster using kustomize. The kustomize setup, to create a sandbox environment, can be found in sandbox/kustomization.yaml

License

Licensed under the EUPLv1.2

You can find more information about the license here.