Google Summer of Code 2023 - AboutCode

Google Summer of Code 2023 - AboutCode

Report for GSoC'23

·

3 min read

About Me

I am Jay Kumar and have been working with AboutCode throughout the summer. I've been actively engaged in making the ScancodeToolkit more portable and accessible to a variety of users. It's been an enriching experience, aligning my passion for programming with the opportunity to contribute to OSS.

About ScanCode Toolkit

ScanCode detects and normalizes origin, dependencies, licensing and other related information in your code:

  • by parsing package manifests and dependencies locking files to a normalized metadata model and assigning each an identifying Package URL,

  • by detecting license tags, notices and texts in text and binaries using the world's most comprehensive database of license texts and notices and a unique combination of techniques,

  • by recognizing copyright statements using an advanced natural language parsing grammar and detecting other origin clues (such as emails, URLs, and authors)

Project Overview

  • SCTK uses pyahocorasick, intbitset for license detection, matching and lxml for detecting licenses and metadata from XML files.

  • All three of them are currently implemented as an Extension of Python using C or Cython.

  • These libraries are important for the performance of ScanCode but they need to be compiled and installed in a specific way, which may not be possible on all computers.

  • This can make ScanCode less portable or difficult to use for some users who do not have the necessary tools or libraries installed on their machines.

  • Therefore to address this issue, the proposal is to create fallback dependencies that are written in Python and can be used by ScanCode when the required C libraries are not available, making it easier to install and use on a wider range of machines.

Issue References

Idea: https://github.com/nexB/aboutcode/wiki/GSOC-2023#create-pure-python-fallback-dependencies

Issue: https://github.com/nexB/scancode-toolkit/issues/3210

Achieved Goals

  1. ahocode: This is the fallback dependency for pyahocorasick. It is built on top of ahocorapy library. This is the heart of license detection. The basic implementation of the ahocode is complete but due to an unknown bug or logic error, the integration with the ScanCode toolkit fails and has yet to be done.

  2. bitcode: This is the fallback dependency for intbitset. It uses Python's built-in set and set operations to do bitset operations. This is used for license matching and building indices. The basic implementation of the bitcode is complete. It has almost been integrated with the scancode toolkit with almost all tests passing.

  3. sanexml: This is the fallback dependency for lxml's etree module. It uses Python's inbuilt xml module and BeautifulSoup4 for lenient XML parsing. This is used for license and metadata parsing. The basic implementation of the sanexml is complete. It has been successfully integrated with scancode toolkit with all tests passing.

  4. Scancode toolkit Integration: The integration of all these three libraries with the Scancode toolkit.

Future Goals

For the immediate future, my aim is to:

  1. Find the MemoryLeak/LogicError in ahocode which ultimately leads to a Memory Error while creating the automaton.

  2. Find the bug in bitcode, which leads to the failing of a few tests.

In the long term, I would like to further optimize these libraries and contribute more to AboutCode.