Synthetic Data Generator
This project is intended to gather all base components of the Synthetic Data Generator v2 product, for ease of development.
This product is designed to be extendable, so it’s possible that not all domain-specific simulators and interpreters are kept in this project. However, it should be trivial to (temporarily) add them here for development purposes.
Overview
The project contains of 3 base components and 5 implementation specific components. The base components are generic and are able to handle all sorts of data. The current implementation specific components are build to handle a subsection of the BRP.
C4Context title System Context diagram for synthetic data generation Boundary(b0, "Implementation specific") { System(s4, "Seeder") System(s5, "Simulator") System(s6, "Interpreter") Boundary(b0b0, "Fictief BRP") { System(s7, "Backend") System(s8, "Frontend") } } Boundary(b1, "Base Components") { System(s3, "Nats") System(s2, "Claim Backend") Boundary(b1b0, "Orchestrator") { System(s0, "Frontend", "The UI") System(s1, "Backend", "The system in control") } } Rel(s0, s1, "", "http") BiRel(s4, s3, "events/commands", "tcp/ip") Rel(s1, s4, "", "http") Rel(s1, s3, "commands", "tcp/ip") BiRel(s5, s3, "events/commands", "tcp/ip") BiRel(s3, s6, "events/commands", "tcp/ip") Rel(s6, s2, "", "http") Rel(s7, s2, "", "http") Rel(s8, s7, "", "http") UpdateLayoutConfig($c4ShapeInRow="3", $c4BoundaryInRow="1")
Base components
The base components consists of an orchestrator, claims-backend and a nats instance.
Orchestrator
The orchestrator controlles and tracks all components. It exists of a backend and frontend. The frontend can control the system be starting, seeding or resetting it. The backend will propagate these commands into the system through api calls or messages in the nats queue.
Claim backend
The claim backend is a datastorage based on claims. A claim contains one bit of value such as a name or birthday. More information about the claim-backend can be found: here
Nats
Nats is a messaging system that can keep track of messages, objects, streams and the subjects that wants to receive them.
Implementation specific components
The implementation specific components exists of a seeder, simulator, interpreter, backend and frontend.
Seeder
The seeder will jumpstart the simulation and will make sure that it starts with data. For the FRP this means that people are created to make sure new people can be born.
Simulator
The simulator can created, change or delete objects in the simulation. It receives time movement from the orchestrator and can act upon it.
Interpreter
The interpreter will receive events from the simulator and can translate it into claims for the claim backend.
Backend
The backend can pull data out of the claim backend en translate it into a model that can be used.
Frontend
The frontend can display the data coming out of the backend
Note
The current base components claim backend
can be replaced by any sort of data storage. We have choosen to use this kind of data storage in combination with a different project that is currently being developed on the digilab platform. This project is called uit betrouwbare bron
or in short UBB. More information can be found: here
.
Usage
The current frp-implemenation can be used through a web app and an API. The web app is accessable on: link
. You can click through the web pages and view information about generated personage.
It is also possible to access the API. The API is available on: https://sdg-frp-backend.apps.digilab.network. The openapi spec of the api is accessable on: /openapi.yaml
.
Using curl a request would look something like:
curl --request POST 'https://sdg-frp-backend.apps.digilab.network/v0/personen' --header 'Content-Type: application/json' --data-raw '{
"type": "RaadpleegMetBurgerservicenummer",
"burgerservicenummer": ["819366080"],
"fields": ["burgerservicenummer"]
}'
response:
{
"type": "RaadpleegMetBurgerservicenummer",
"personen": [
{
"burgerservicenummer": "819366080"
}
]
}
The API is based on the official haal-centraal specs, the spec can be found here
.
Simulation parameters
To start the simulation we need some parameters for the simulator to known when to create new objects and what to do with them. The FPR simulator exists of two events and twelve datasets. The events consists of a birth event and a death event. This will make sure that we control the amount of our population by increase or decrease the birth or death event.
The datasets being used are:
- First name females (src: https://www.svb.nl/nl/kindernamen/namen/meisjes-populariteit. Note: currently only recently popular names are used, generated names can be improved by also using historical names)
- First name males (src: https://www.svb.nl/nl/kindernamen/namen/jongens-populariteit)
- First name non-binary (src: https://www.babynamen.nl/namen/unisex/#h-de-100-populairste-unisex-namen)
- Family names (src: https://nl.wikipedia.org/wiki/Lijst_van_meest_voorkomende_achternamen_van_Nederland (original source: https://cbgfamilienamen.nl/nfb/documenten/top100.pdf). IMPROVE: add number of occurrences per name and use this for random name selection)
- Genders (src: according to https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/rapporten/2023/07/07/tk-bijlage-1-onderzoeksrapport-rutgers-kiezen-voor-een-x/tk-bijlage-1-onderzoeksrapport-rutgers-kiezen-voor-een-x.pdf, ~3.7% of respondents is (very) likely to set the gender to “X” when possible via the municipality and another 0.4% did not want to answer this question. We therefore use a guesstimate of 3.7% + 0.4%/2 = 3.9% as seed weight for gender X and assume that the other genders are divided according to the data from https://www.cbs.nl/nl-nl/visualisaties/dashboard-bevolking/mannen-en-vrouwen (measured on January 1st, 2022). IMPROVE: improve these weights when more data is available?)
- Municipalities with the amount of babies born (src: https://www.cbs.nl/nl-nl/maatwerk/2024/10/voorlopige-bevolkingsaantallen-per-gemeente )
- Street names
- House letters (src: https://www.kadaster.nl/-/kosteloze-download-bag-2-0-extract)
- House number additions
- Birth places abroad (src: https://publicaties.rvig.nl/Landelijke_tabellen/Landelijke_tabellen_32_t_m_61_excl_tabel_35/Landelijke_Tabellen_32_t_m_61_in_csv_formaat/Tabel_34_Landen_gesorteerd_op_omschrijving and https://en.wikipedia.org/wiki/List_of_largest_cities#List, based on UN DESA 2018 counts/estimations)
- Birth probability (src: https://opendata.cbs.nl/#/CBS/nl/dataset/37201/table?dl=94E25 and https://opendata.cbs.nl/#/CBS/nl/dataset/03759ned/table?dl=94E28)
- Death probability (src: https://opendata.cbs.nl/statline/#/CBS/nl/dataset/37360ned/table?fromstatweb)
With just the datasets it is hard to get to the exact probility numbers that are used in the simulation. These numbers can be found below.
Birth probability
We take the probability, given the age of a female* person in years, that this person will bear a living child in one year time. Interpolated with Gaussian Bell curve fit (e.g. using https://mycurvefit.com/) This results in the following data points:
age | Probability |
---|---|
12.5 | 0.00020557 |
22.5 | 0.00858504 |
27.5 | 0.03612038 |
32.5 | 0.05931208 |
37.5 | 0.03316359 |
42.5 | 0.00722938 |
60 | 0.00219446 |
Resulting parameters: mean = 32.29439, standard deviation = 4.904455, factor ‘a’ = 0.05897603
- Here ‘female’ denotes someone who is, when having a child, indicated as ‘mother’ by the CBS
Death probability
The probability of death is based on this dataset: https://opendata.cbs.nl/statline/#/CBS/nl/dataset/37360ned/table?fromstatweb We take the 2022 data and use the average of male and female subjects. Then, since people in general do not reach a higher age than 120 years (https://en.wikipedia.org/wiki/List_of_the_verified_oldest_people), we add an extra data point of a probability of death of 1 at 120 years. This results in the following data points for the probability of death within a year, given the person’s age in years:
Age | Probability |
---|---|
0 | 0.00445 |
21 | 0.00032 |
61 | 0.005845 |
81 | 0.048975 |
120 | 1 |
With curve fitting (e.g. using a tool like https://mycurvefit.com/), assuming a power curve, we obtain the following equation: y = 1.119401e-16 * x^7.671776
Note: The current probability does not compensate for COVID-19 deaths, see https://www.cbs.nl/nl-nl/nieuws/2024/06/sterfte-in-2023-afgenomen for some details Note: An alternative would be to define a random life span per agent at birth and at each step to check whether or not the life span is reached. However, little information seems to be available about the life span standard deviation, except that is was ‘approximately 1 year’ around 2008
Running the project
The project can be run by following the guide in the CONTRIBUTING.md
file.
Developer documentation
If you would like to contribute to this project, consult the CONTRIBUTING.md
file.
Deployment
The project can be deployed by using the deployment files in the deploy
folder. The current deployment is setup to create a k8s cluster using kustomize.
The kustomize setup, to create a sandbox environment, can be found in sandbox/kustomization.yaml
License
Copyright © VNG Realisatie 2024