Skip to content

Conversation

@mustansir14
Copy link
Contributor

@mustansir14 mustansir14 commented Dec 11, 2025

Description:

This PR adds the changes to include gitlab project details like project ID, project name and project owner name to the metadata of a chunk.

Achieving this wasn't straightforward as the chunks are generated by the git source and it calls an injected SourceMetadataFunc to create the metadata. The signature of this function obviously does not include the gitlab project ID.

The solution implemented here is to maintain a repoToProjectCache map that stores the project details for a repo. This cache is populated as we enumerate the repos and the callback function uses this cache to populate the project details.

The only concerning part of this solution is memory usage, so just to put in perspective how much this impacts that, I found a public organization with ~3000 projects and ran benchmarks for before and after making the changes. (I know this isn't anywhere close to the largest organizations we've come across, but this is the biggest one I could find that was public)

Results before making changes:

Total memory usage: 43234992 bytes (43.23 MBs)

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkMemoryUsage$ github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab

2025-12-11T18:58:01+05:00	info-0	context	starting projects enumeration	{"list_options": {"pagination":"keyset","order_by":"id","sort":"asc"}, "all_available": false}
2025-12-11T18:59:01+05:00	info-0	context	Enumerated GitLab projects	{"count": 2930}
Memory used during benchmark:
Allocated (Alloc): 43234992 bytes
Total Allocated (TotalAlloc): 43234992 bytes
Memory obtained from system (Sys): 17629184 bytes
goos: linux
goarch: amd64
pkg: github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab
cpu: 12th Gen Intel(R) Core(TM) i5-1235U
BenchmarkMemoryUsage-12    	       1	59642136351 ns/op	42916920 B/op	  246843 allocs/op
PASS
ok  	github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab	60.146s

Results after this change:

Total memory usage: 44666216 bytes (44.66 MBs)

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkMemoryUsage$ github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab

goos: linux
goarch: amd64
pkg: github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab
cpu: 12th Gen Intel(R) Core(TM) i5-1235U
=== RUN   BenchmarkMemoryUsage
BenchmarkMemoryUsage
2025-12-11T18:38:02+05:00       info-0  context starting projects enumeration   {"list_options": {"pagination":"keyset","order_by":"id","sort":"asc"}, "all_available": false}
2025-12-11T18:39:09+05:00       info-0  context Enumerated GitLab projects      {"count": 2930}
Memory used during benchmark:
Allocated (Alloc): 44666216 bytes
Total Allocated (TotalAlloc): 44666216 bytes
Memory obtained from system (Sys): 21823488 bytes
BenchmarkMemoryUsage-12                1        66827846825 ns/op       44350440 B/op     261384 allocs/op
PASS
ok      github.com/trufflesecurity/trufflehog/v3/pkg/sources/gitlab     67.355s

It's important to note that this doesn't really perform a full scan (I tried doing that first but it would take hours), so I tweaked the code to expose the callback function and directly called that after enumerating the repos. This is the code used for benchmarking:

func BenchmarkMemoryUsage(b *testing.B) {
	// Enable memory allocation reporting
	b.ReportAllocs()

	// Run a specific function and measure memory usage
	var mStart, mEnd runtime.MemStats

	// Get memory stats before execution
	runtime.ReadMemStats(&mStart)

	connection := &sourcespb.GitLab{
		Endpoint: "https://gitlab.eclipse.org",
	}

	s := Source{}

	conn, err := anypb.New(connection)
	if err != nil {
		b.Fatal(err)
	}

	err = s.Init(context.Background(), "benchmark", 0, 0, false, conn, 100)
	if err != nil {
		b.Errorf("Source.Init() error = %v", err)
		return
	}

	feature.UseSimplifiedGitlabEnumeration.Store(true)

	// Run the benchmark loop
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		reporter := &sourcestest.TestReporter{}
		err = s.Enumerate(context.Background(), reporter)
		if err != nil {
			b.Errorf("Source.Chunks() error = %v", err)
			return
		}

		for _, unit := range reporter.Units {
			repository, _ := unit.SourceUnitID()
			_ = s.cfg.SourceMetadataFunc("test file", "test email", "test commit", "test timestamp", repository, "test local path", 1)
		}
	}

	// Get memory stats after execution
	runtime.ReadMemStats(&mEnd)

	// Calculate the memory usage
	allocDiff := mEnd.Alloc - mStart.Alloc
	totalAllocDiff := mEnd.TotalAlloc - mStart.TotalAlloc
	sysDiff := mEnd.Sys - mStart.Sys

	// Log the memory stats
	fmt.Printf("Memory used during benchmark:\n")
	fmt.Printf("Allocated (Alloc): %d bytes\n", allocDiff)
	fmt.Printf("Total Allocated (TotalAlloc): %d bytes\n", totalAllocDiff)
	fmt.Printf("Memory obtained from system (Sys): %d bytes\n", sysDiff)
}

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@mustansir14 mustansir14 requested a review from a team December 11, 2025 14:32
@mustansir14 mustansir14 requested a review from a team as a code owner December 11, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant