<center>
# Monadical Project Proposal for Kiwix.org [example]
*2020-03-18 by @ Monadical.com*
</center>
---
**Summary:** `Node/Python binding library refactoring and implementation work.`
**Proposal by:** Nick Sweeting, Juan Diego Caballero @ [Monadical Consulting](https://monadical.com)
**For client:** [Kelson Engelhart](https://www.linkedin.com/in/emmanuelengelhart/) @ Kiwix.org / OpenZIM.org
---
**Contents:**
[TOC]
## Quick Contact Info
> Schedule a call any time to chat with us:
https://calendly.com/nicksweeting/choose-a-time
**Monadical:**
- **Nick Sweeting** *(Co-Founder)*
nick@monadical.com
`+1 (503) 741-9577`
- **Max McCrea** *(Co-Founder)*
max@monadical.com
**Kiwix.org / OpenZIM.org:**
- **Emmanuel "Kelson" Engelhart** *(Co-Founder)*
kelson@kiwix.org
https://github.com/kelson42
https://kiwixoffline.slack.com
---
## Project Overview
### Project 1: Refactoring `node-libzim` from `nbind` to `N-API`
- Old `nbind` dependency is currently preventing upgrade from Node v8 to v13
- Newer `N-API` package provides nicer binding architecture
- Strive making the build & install process perfect, w/ no breaking changes to existing API
#### Development Timeline & Budget
This is an estimate of the development time it will take to achieve each project milestone.
- ~**1 week:** upgrade node-libzim to use N-API, w/ docker build process
- ~**1 week:** improve test suite, implement CI/CD, and write documentation
- ~**(optional extra):** extend it with more of the remaining libzim methods that aren't yet exposed in the current version
:::info
**Total:** 1-2 weeks of full-time work with 1 senior developer on the project.
:::
There's a chance we can do it faster because the API implementation work itself is very simple, but this estimate includes safety margin time to ramp up on the ZIM format, N-API, and the Node binding ecosystem in general
This estimate **assumes 1 full-time senior developer is working on this project**. If the project is done part-time instead, each billing cycle will be cheaper but more total development time will likely be needed to complete the project.
#### Risks & Concerns
**These are the biggest factors that influence the delivery timeline:**
- Difficulties with passing large buffers across the language binary, while maintaining atomicity
- Difficulty with making the data types and `async/await` calls threadsafe in C++
- If needed: making it compile on macOS without Docker (theres currently only 1 error, maybe this is easy but we're not sure yet)
#### Final Deliverables
- Working `node-libzim` library usable by `mwoffliner` on Node v12/13 without `nbind`
- Minimal API disruption for existing users
- Build process that works seamlessly and deterministically in Docker
- End-user installation that works seamlessly with `yarn`/`node` on Linux, Docker (macOS later, not now)
- CI/CD that tests/builds/releases automatically upon pull requests / merges (just for the bindings, not all of libzim)
- Documentation for end-users and developers
---
### Project 2: Creating the `python-libzim` package
- Will become the canonical python-facing API for creating and interacting with ZIM files, used both internally by the scrapers and externally by users consuming ZIM files
- Will start by exposing the core `libzim` API used by `sotoki` and `mwoffliner` (described below), with potential to expand to the rest of the `libzim` functionality
- Can be partly property-based auto-generated Cython code, based on `libzim`'s `.h` files
- Should be threadsafe, usable with `async/await`, `multiprocessing`, or `Threading`
- Should be built and tested w/ CI/CD and good documentation
#### Development Timeline & Budget
This is an estimate of the development time it will take to achieve each project milestone.
- ~**1 week:** Setting up tooling, defining cross-language data types, linking main `libzim` callsites with Cython/pybind, researching most efficient multiprocessing message-passing approaches
- ~**1-2 weeks:** Implementing main ZimReader & ZimWriter calls with tests, adding Suggest/Search/lookup functionality, checking `sotoki` compatibility
- ~**1 weeks:** Implementing `async/await` compatibility, improving test suite, documentation, CI/CD process for developers
- **(optional):** Implementing additional API surface from `libzim` beyond simple ZIM creation/reading used by `sotoki`
:::info
**Total:** 2-4 weeks of full-time work with 1 senior developer on the project.
(we've already started implementing a bit of this though, so we have a bit of a head-start)
:::
These estimates assume we already have some ramp-up experience from doing the `node-libzim` bindings, and that **1 solo full-time senior developer is working on this project**.
#### Risks & Concerns
**These are the biggest factors that influence the delivery timeline:**
- Whether additional API needs to be exposed beyond the calls used by `mwoffliner`/`sotoki`
- Whether we need to use something other than Cython / pybind11 for compatibility reasons
- Making it threadsafe for both `multiprocessing` and `async/await` use-cases, whether by using message-passing/queues or shared memory with locks (if atomicity is needed, if not we can skip this work)
#### Final Deliverables
- Working [`python-libzim`](https://github.com/openzim/python-libzim) PyPI package for `sotoki` that doesn't use `zimwriterfs`
- `pip install ...` process that works across Linux, and Docker (macOS not now)
- C++ compilation process that works seamlessly in Docker
- Tests that build & run in CI automatically upon pull requests / merges (+release if needed) (just for the bindings, not all of libzim)
- https://github.com/openzim/node-libzim/blob/master/src/test.ts
- https://github.com/openzim/mwoffliner/blob/master/test/e2e/bm.e2e.test.ts
- https://hub.docker.com/r/openzim/mwoffliner using this as well
- and any new ones needed
- Documentation for both end-users and developers
- Update sotoki to work with the new python-libzim (update from zimwriterfs)
---
## Technical Overview
### Node Bindings
The C++ `libzim` has 52 classes, but `mwoffliner` and `sotoki` only needs to use 3 of them.
This is the API we'll make sure works based on the existing tests here:
https://github.com/openzim/node-libzim/blob/master/src/test.ts
https://github.com/openzim/mwoffliner/blob/master/test/e2e/bm.e2e.test.ts
This API is basically the same as what's needed for `sotoki` (based on what we can tell so far), so we'll aim to keep it relatively consistent between packages.
```javascript
// Create a ZIM file
const creator = new ZimCreator({
fileName: 'test.zim',
welcome: 'welcome',
fullTextIndexLanguage: 'eng',
minChunkSize: 2048,
}, {});
// record an Article
// add some content and iamges
const imageContent = Buffer.from(faker.image.image());
const articleContent = faker.lorem.paragraphs(3);
const articleTitle = faker.lorem.words(faker.random.number({ min: 1, max: 4 }));
const articleUrl = articleTitle.replace(/ /g, '_');
const a = new ZimArticle({url: articleUrl, data: articleContent, title: articleTitle, mimeType: 'text/html', shouldIndex: true, ns: 'A'})
await creator.addArticle(a)
await creator.finalise()
// record an Image
const a = new ZimArticle({url: imgUrl, data: imageContent, mimeType: 'image/png', ns: 'I'})
await creator.addArticle(a)
await creator.finalise()
// record an Redirect
const redirectUrl = articleTitle.replace(/ /g, '_')
const a = new ZimArticle({url: redirectUrl, redirectUrl: articleUrl, data: '', title: articleTitle, mimeType: 'text/html', shouldIndex: true, ns: 'A'})
await creator.addArticle(a)
await creator.finalise()
// Read back a ZIM file
const zimFile = new ZimReader(path.join(__dirname, '../test.zim'))
const numberOfArticles = await zimFile.getCountArticles()
console.info(`Count Articles:`, numberOfArticles)
const firstArticle = await zimFile.getArticleById(0)
console.info(`First Article:`, firstArticle)
const suggestResults = await zimFile.suggest('laborum')
console.info(`Suggest Results:`, suggestResults)
const searchResults = await zimFile.search('rem')
console.info(`Search Results:`, searchResults)
const readArticleContent = await zimFile.getArticleByUrl('A/laborum')
console.info(`Article by url (laborum):`, readArticleContent)
await zimFile.destroy()
```
#### Resources
- https://nodejs.org/api/n-api.html
- https://github.com/nodejs/node-addon-api
- might be possible to auto-generate N-API bindings from labeled libzim code to save on maintenance burden going forward: https://github.com/Geode-solutions/genepi
- https://codemerx.com/blog/asynchronous-c-addon-for-node-js-with-n-api-and-node-addon-api/
- https://medium.com/the-node-js-collection/n-api-next-generation-node-js-apis-for-native-modules-169af5235b06
- https://github.com/openzim/node-libzim/blob/master/src/test.ts
- https://medium.com/@atulanand94/beginners-guide-to-writing-nodejs-addons-using-c-and-n-api-node-addon-api-9b3b718a9a7f
- https://github.com/openzim/node-libzim/tree/master/src
- https://github.com/openzim/mwoffliner/blob/master/test/e2e/bm.e2e.test.ts
### Python Bindings
> Sotoki will likely only need the same 3 libzim classes that mwoffliner uses.
This is a summary of the API we'll have to write for an initial `python-libzim` implementation. The code is virtually the same as the node version above, so we'll omit the code examples here.
https://github.com/openzim/python-libzim
#### 1. Article
- [x] ZimArticle constructor (that supports taking binary blobs of arbitrary MIME types)
#### 2. ZimCreator
- [x] ZimCreator constructor (we've already got this working)
- [x] addArticle (in-progress, a little tricky to deal with passing article blob object)
- [ ] setMetadata (pretty easy, can just take some key/value pairs)
- [x] finalise (super easy)
#### 3. ZimReader
- [x] ZimReader constructor (we've already got this working)
- [x] destroy (can be a method in python land, or we can hook it up to `__del__`?)
- [x] getCountArticles (we've already got this working)
- [x] getArticleByUrl (super easy, not much work to implement)
- [x] getArticleById (same here)
- [x] suggest (easy to call, but returns an iterator that may need some work to translate)
- [x] search (easy to call, but returns an iterator that may need some work to translate)
Our work on this so far was inspired by Matthew G.'s existing code from here:
https://framagit.org/mgautierfr/pyzim/-/blob/master/pyzim/zim_wrapper.pxd
#### Resources
- https://framagit.org/mgautierfr/pyzim
- https://github.com/pediapress/pyzim
- https://github.com/jarondl/pyzimmer/blob/master/pyzimmer/zim_writer.py
- https://github.com/cython/cython/wiki/AutoPxd
- https://www.youtube.com/watch?v=YReJ3pSnNDo
- https://github.com/openzim/zim-tools/blob/master/src/zimrecreate.cpp (Article's Abstract Class wrapper)
- [Paul Ganssle -- Build your Python Extensions with Rust!](https://www.youtube.com/watch?v=1KC43QuJKlE) (also talks about C++)
---
## Timeline: Next Steps
- [x] 1. Introductory calls with Kiwix & Monadical teams
- [x] 2. Begin project technical research and ramp-up
- [ ] 3. Review development timeline, budget, and team structure proposal
<a href="https://calendly.com/nicksweeting/choose-a-time" class="btn btn-success btn-sm">Click to schedule this call >></a>
- [ ] 4. Confirm timeline, deliverables, billing details, and sign software development contract
- [ ] 5. Begin development of `node-libzim`
- [ ] checkin often via Slack + calls as needed
- [ ] submit WIP and final PRs as code is ready for review
- [ ] help ensure smooth release and provide support & bugfixing assistance
- [ ] 6. Begin development of `python-libzim`
- [ ] same as above
---
<center>
<img src="https://monadical.com/static/logo-black.png" style="height: 80px"/><br/>
Monadical.com | Full-Stack Consultancy
</center>