312 views
<center> # Monadical Project Proposal for Kiwix.org [example] *2020-03-18 by @ Monadical.com* </center> --- **Summary:** `Node/Python binding library refactoring and implementation work.` **Proposal by:** Nick Sweeting, Juan Diego Caballero @ [Monadical Consulting](https://monadical.com) **For client:** [Kelson Engelhart](https://www.linkedin.com/in/emmanuelengelhart/) @ Kiwix.org / OpenZIM.org --- **Contents:** [TOC] ## Quick Contact Info > Schedule a call any time to chat with us: https://calendly.com/nicksweeting/choose-a-time **Monadical:** - **Nick Sweeting** *(Co-Founder)* nick@monadical.com `+1 (503) 741-9577` - **Max McCrea** *(Co-Founder)* max@monadical.com **Kiwix.org / OpenZIM.org:** - **Emmanuel "Kelson" Engelhart** *(Co-Founder)* kelson@kiwix.org https://github.com/kelson42 https://kiwixoffline.slack.com --- ## Project Overview ### Project 1: Refactoring `node-libzim` from `nbind` to `N-API` - Old `nbind` dependency is currently preventing upgrade from Node v8 to v13 - Newer `N-API` package provides nicer binding architecture - Strive making the build & install process perfect, w/ no breaking changes to existing API #### Development Timeline & Budget This is an estimate of the development time it will take to achieve each project milestone. - ~**1 week:** upgrade node-libzim to use N-API, w/ docker build process - ~**1 week:** improve test suite, implement CI/CD, and write documentation - ~**(optional extra):** extend it with more of the remaining libzim methods that aren't yet exposed in the current version :::info **Total:** 1-2 weeks of full-time work with 1 senior developer on the project. ::: There's a chance we can do it faster because the API implementation work itself is very simple, but this estimate includes safety margin time to ramp up on the ZIM format, N-API, and the Node binding ecosystem in general This estimate **assumes 1 full-time senior developer is working on this project**. If the project is done part-time instead, each billing cycle will be cheaper but more total development time will likely be needed to complete the project. #### Risks & Concerns **These are the biggest factors that influence the delivery timeline:** - Difficulties with passing large buffers across the language binary, while maintaining atomicity - Difficulty with making the data types and `async/await` calls threadsafe in C++ - If needed: making it compile on macOS without Docker (theres currently only 1 error, maybe this is easy but we're not sure yet) #### Final Deliverables - Working `node-libzim` library usable by `mwoffliner` on Node v12/13 without `nbind` - Minimal API disruption for existing users - Build process that works seamlessly and deterministically in Docker - End-user installation that works seamlessly with `yarn`/`node` on Linux, Docker (macOS later, not now) - CI/CD that tests/builds/releases automatically upon pull requests / merges (just for the bindings, not all of libzim) - Documentation for end-users and developers --- ### Project 2: Creating the `python-libzim` package - Will become the canonical python-facing API for creating and interacting with ZIM files, used both internally by the scrapers and externally by users consuming ZIM files - Will start by exposing the core `libzim` API used by `sotoki` and `mwoffliner` (described below), with potential to expand to the rest of the `libzim` functionality - Can be partly property-based auto-generated Cython code, based on `libzim`'s `.h` files - Should be threadsafe, usable with `async/await`, `multiprocessing`, or `Threading` - Should be built and tested w/ CI/CD and good documentation #### Development Timeline & Budget This is an estimate of the development time it will take to achieve each project milestone. - ~**1 week:** Setting up tooling, defining cross-language data types, linking main `libzim` callsites with Cython/pybind, researching most efficient multiprocessing message-passing approaches - ~**1-2 weeks:** Implementing main ZimReader & ZimWriter calls with tests, adding Suggest/Search/lookup functionality, checking `sotoki` compatibility - ~**1 weeks:** Implementing `async/await` compatibility, improving test suite, documentation, CI/CD process for developers - **(optional):** Implementing additional API surface from `libzim` beyond simple ZIM creation/reading used by `sotoki` :::info **Total:** 2-4 weeks of full-time work with 1 senior developer on the project. (we've already started implementing a bit of this though, so we have a bit of a head-start) ::: These estimates assume we already have some ramp-up experience from doing the `node-libzim` bindings, and that **1 solo full-time senior developer is working on this project**. #### Risks & Concerns **These are the biggest factors that influence the delivery timeline:** - Whether additional API needs to be exposed beyond the calls used by `mwoffliner`/`sotoki` - Whether we need to use something other than Cython / pybind11 for compatibility reasons - Making it threadsafe for both `multiprocessing` and `async/await` use-cases, whether by using message-passing/queues or shared memory with locks (if atomicity is needed, if not we can skip this work) #### Final Deliverables - Working [`python-libzim`](https://github.com/openzim/python-libzim) PyPI package for `sotoki` that doesn't use `zimwriterfs` - `pip install ...` process that works across Linux, and Docker (macOS not now) - C++ compilation process that works seamlessly in Docker - Tests that build & run in CI automatically upon pull requests / merges (+release if needed) (just for the bindings, not all of libzim) - https://github.com/openzim/node-libzim/blob/master/src/test.ts - https://github.com/openzim/mwoffliner/blob/master/test/e2e/bm.e2e.test.ts - https://hub.docker.com/r/openzim/mwoffliner using this as well - and any new ones needed - Documentation for both end-users and developers - Update sotoki to work with the new python-libzim (update from zimwriterfs) --- ## Technical Overview ### Node Bindings The C++ `libzim` has 52 classes, but `mwoffliner` and `sotoki` only needs to use 3 of them. This is the API we'll make sure works based on the existing tests here: https://github.com/openzim/node-libzim/blob/master/src/test.ts https://github.com/openzim/mwoffliner/blob/master/test/e2e/bm.e2e.test.ts This API is basically the same as what's needed for `sotoki` (based on what we can tell so far), so we'll aim to keep it relatively consistent between packages. ```javascript // Create a ZIM file const creator = new ZimCreator({ fileName: 'test.zim', welcome: 'welcome', fullTextIndexLanguage: 'eng', minChunkSize: 2048, }, {}); // record an Article // add some content and iamges const imageContent = Buffer.from(faker.image.image()); const articleContent = faker.lorem.paragraphs(3); const articleTitle = faker.lorem.words(faker.random.number({ min: 1, max: 4 })); const articleUrl = articleTitle.replace(/ /g, '_'); const a = new ZimArticle({url: articleUrl, data: articleContent, title: articleTitle, mimeType: 'text/html', shouldIndex: true, ns: 'A'}) await creator.addArticle(a) await creator.finalise() // record an Image const a = new ZimArticle({url: imgUrl, data: imageContent, mimeType: 'image/png', ns: 'I'}) await creator.addArticle(a) await creator.finalise() // record an Redirect const redirectUrl = articleTitle.replace(/ /g, '_') const a = new ZimArticle({url: redirectUrl, redirectUrl: articleUrl, data: '', title: articleTitle, mimeType: 'text/html', shouldIndex: true, ns: 'A'}) await creator.addArticle(a) await creator.finalise() // Read back a ZIM file const zimFile = new ZimReader(path.join(__dirname, '../test.zim')) const numberOfArticles = await zimFile.getCountArticles() console.info(`Count Articles:`, numberOfArticles) const firstArticle = await zimFile.getArticleById(0) console.info(`First Article:`, firstArticle) const suggestResults = await zimFile.suggest('laborum') console.info(`Suggest Results:`, suggestResults) const searchResults = await zimFile.search('rem') console.info(`Search Results:`, searchResults) const readArticleContent = await zimFile.getArticleByUrl('A/laborum') console.info(`Article by url (laborum):`, readArticleContent) await zimFile.destroy() ``` #### Resources - https://nodejs.org/api/n-api.html - https://github.com/nodejs/node-addon-api - might be possible to auto-generate N-API bindings from labeled libzim code to save on maintenance burden going forward: https://github.com/Geode-solutions/genepi - https://codemerx.com/blog/asynchronous-c-addon-for-node-js-with-n-api-and-node-addon-api/ - https://medium.com/the-node-js-collection/n-api-next-generation-node-js-apis-for-native-modules-169af5235b06 - https://github.com/openzim/node-libzim/blob/master/src/test.ts - https://medium.com/@atulanand94/beginners-guide-to-writing-nodejs-addons-using-c-and-n-api-node-addon-api-9b3b718a9a7f - https://github.com/openzim/node-libzim/tree/master/src - https://github.com/openzim/mwoffliner/blob/master/test/e2e/bm.e2e.test.ts ### Python Bindings > Sotoki will likely only need the same 3 libzim classes that mwoffliner uses. This is a summary of the API we'll have to write for an initial `python-libzim` implementation. The code is virtually the same as the node version above, so we'll omit the code examples here. https://github.com/openzim/python-libzim #### 1. Article - [x] ZimArticle constructor (that supports taking binary blobs of arbitrary MIME types) #### 2. ZimCreator - [x] ZimCreator constructor (we've already got this working) - [x] addArticle (in-progress, a little tricky to deal with passing article blob object) - [ ] setMetadata (pretty easy, can just take some key/value pairs) - [x] finalise (super easy) #### 3. ZimReader - [x] ZimReader constructor (we've already got this working) - [x] destroy (can be a method in python land, or we can hook it up to `__del__`?) - [x] getCountArticles (we've already got this working) - [x] getArticleByUrl (super easy, not much work to implement) - [x] getArticleById (same here) - [x] suggest (easy to call, but returns an iterator that may need some work to translate) - [x] search (easy to call, but returns an iterator that may need some work to translate) Our work on this so far was inspired by Matthew G.'s existing code from here: https://framagit.org/mgautierfr/pyzim/-/blob/master/pyzim/zim_wrapper.pxd #### Resources - https://framagit.org/mgautierfr/pyzim - https://github.com/pediapress/pyzim - https://github.com/jarondl/pyzimmer/blob/master/pyzimmer/zim_writer.py - https://github.com/cython/cython/wiki/AutoPxd - https://www.youtube.com/watch?v=YReJ3pSnNDo - https://github.com/openzim/zim-tools/blob/master/src/zimrecreate.cpp (Article's Abstract Class wrapper) - [Paul Ganssle -- Build your Python Extensions with Rust!](https://www.youtube.com/watch?v=1KC43QuJKlE) (also talks about C++) --- ## Timeline: Next Steps - [x] 1. Introductory calls with Kiwix & Monadical teams - [x] 2. Begin project technical research and ramp-up - [ ] 3. Review development timeline, budget, and team structure proposal &nbsp; <a href="https://calendly.com/nicksweeting/choose-a-time" class="btn btn-success btn-sm">Click to schedule this call >></a> - [ ] 4. Confirm timeline, deliverables, billing details, and sign software development contract - [ ] 5. Begin development of `node-libzim` - [ ] checkin often via Slack + calls as needed - [ ] submit WIP and final PRs as code is ready for review - [ ] help ensure smooth release and provide support & bugfixing assistance - [ ] 6. Begin development of `python-libzim` - [ ] same as above --- <center> <img src="https://monadical.com/static/logo-black.png" style="height: 80px"/><br/> Monadical.com | Full-Stack Consultancy </center>