Duncan’s Word Data Generator for velut




Testing

Open your browser’s console, then click “Run tests”. The tests confirm that the behaviour of my data-generating functions is as I expect.

About

Note: If you’re not me, you’re unlikely to have much use for this Word Data Generator.

The Word Data Generator generates the data for my Latin rhyming dictionary, velut; specifically it generates the “words” data. For each word and corresponding list of lemmata, this generates fields such as the phonetic representation, the number of syllables, the rhyming-part, and strings that words are sorted on.

This webpage is nice for showing you what the generator does. You can input several words (and the lemma or lemmata for each) in the first box, click “Generate Json”, and see the output in the second box. The resultant Json can be downloaded or copied to the clipboard.

But the generator can also be used by simply running one JavaScript file named generator.js in Node, outside of the browser. If it’s running in Node, it reads from a hardcoded filepath and saves its output to another hardcoded filepath. (It also saves the output to a smaller file for each batch of 50,000 words.) Once the output is on my hard drive, I have a script to upload it to the MongoDB database that the velut website uses.

Input and output format

Each line of input must be a word, whitespace, and the space-separated list of lemmata. The “Load sample” button will give you some examples; the examples use a tab for the whitespace between the word and the lemmata, because you get tabs when pasting from Excel cells. But you can use a normal space (or several) if you prefer.

The Json generated does not have commas separating the objects, or square brackets around the entire array. This is not the standard Json format, but is the format required by mongoimport (which is the tool my script uses to import into the database).

The velut Excel file & how I’ve replaced parts of it

Although the velut website uses a MongoDB database, and this page produces Json data for the MongoDB database, I privately have a large Excel file for generating and storing the data in velut. This webpage is intended to replace a sheet called “wordsform” in that Excel file. The sheet can generate all the data from words and lemmata entered into the second and third columns — which is why the second and third fields of the output are Word and Lemmata, the two columns that don’t have Excel formulae.

One of the differences between the “wordsform” sheet and this Word Data Generator is that in the sheet the output data are in Excel cells, but in the generator they’re in Json format. Copying data from Excel makes them tab-delimited. To convert the tab-delimited data to Json, I use my Json Generator, which is a separate webpage. But I have less need of that than when I didn’t have this Word Data Generator, because the data from this are already in Json format. (The Json Generator is still useful for other sheets in the Excel file.)

The benefit of running generator.js in Node is that I can process all my Latin words (more than 120 thousand), in about twenty seconds. If I tried in the browser for that quantity of data, my browser would freeze, unsurprisingly! Likewise, Excel would probably crash if I tried to use the “wordsform” sheet to regenerate all the data.

Version control

I track the data-files in Git so I can check whether a change to my code has (inadvertently or deliberately) altered the output. But I don’t track the file that contains all the output — it’s huge. Instead, the Node-only code splits the data into batches of 50,000 words and saves the batches as files, and Git tracks those files.

Checking the output in Node

I can also use Node to check the output against all the “words” data I previously generated. The code for this check is at the end of generator.js. When I ran it against all the “words” data I had from Excel, everything matched, except for some cases where I had bugs in the Excel which I have corrected in the JavaScript. These changes of behaviour are listed in the next section.

Behaviour changes between my Excel and JavaScript code

Some changes would have been noticeable because of inaccuracies on the pages on the velut website for these words. For example, coiēns was scanned as –– instead of ⏑⏑–.

Other bug-fixes do not change anything displayed on the velut website. But I wanted all the code here to be correct, even when not (yet) used directly.

Behaviour I might change in the future

Excel (and my lack of formal training in software development) led to me doing some things with the Word Data Generator that I wouldn’t have done if I hadn’t used Excel to create velut.

Testing in the browser

If you’re not me, you won’t have access to all the input data, nor will you have access to the data from Excel that I compare the output to in Node. But you can run some tests yourself in your browser’s console by clicking the “Run tests” button above. These tests run the following:

My current workflow for managing velut

The Word Data Generator is reliable enough that I’ve begun using it for real. The velut website uses the data generated.

Nonetheless, I’m still using the “wordsform” Excel sheet within Excel, because I don’t want to break other parts of the Excel file. Much of my Excel file still relies on me having all the data in Excel — not in the Json format that this page produces. Eventually enough of the Excel file will be obsolete that I can stop storing all my data there in Excel. Only then will I stop using the “wordsform” sheet.

It’s all part of my long-term project of converting my Excel file into websites and webpages that are easier to share and maintain. I’m very much in a transition period of using the Excel file for some things and my newer websites/webpages for others. But the Word Data Generator is another step in the process. At the moment, the whole velut project is very convoluted; in the future, it won’t be as bad.