Storing records

Save records into a local SQLite store across calls, read them back with db query, and manage the page cache.

linkedin has no bulk crawler: no sitemap seed, no queue, no drain. The way you build a dataset is to pass --save on the commands that support it, run them as many times as you like, and read the accumulated records back out with db. Everything lands in one SQLite file under your data dir.

Saving records

profile, company, and jobs --hydrate all take --save, which upserts each record into the store. Run them across a session and the store fills up:

linkedin company microsoft --save
linkedin company github --save
linkedin profile williamhgates --save
linkedin jobs "golang engineer" --hydrate --save -n 50

Because it is an upsert, re-running a command updates the existing row rather than duplicating it, so you can refresh a record later by saving it again (add --refresh to force a fresh fetch first).

Reading the store back

db works with the local SQLite store:

linkedin db path     # print the store file path
linkedin db count    # how many records are stored
linkedin db query    # read stored records back out

db query emits the stored records through the same formatter as everything else, so you can shape and pipe them:

linkedin db query --output jsonl | jq -r .name
linkedin db query --fields name,industry,employees

A small dataset, end to end

Building a slice of companies and their open roles looks like this:

linkedin company microsoft --posts --save
linkedin jobs "golang engineer" --location "United States" --hydrate --save -n 100
linkedin db query --output jsonl > dataset.jsonl

The page cache

Separate from the store, every fetch goes through an on-disk cache (content-addressed, gzip), so a repeat run does not re-fetch pages that have not changed. cache manages it:

linkedin cache path                                       # the cache directory path
linkedin cache info                                       # location, file count, size
linkedin cache clear                                      # remove every cached page

The cache TTL defaults to 24 hours. Bypass it for one run with --no-cache, or force a re-fetch with --refresh.

Where state lives

The store and cache both live under the data dir; point that elsewhere with --data-dir or LINKEDIN_DATA_DIR. The store file is fixed at <data-dir>/linkedin.db, so to keep one corpus per project, point --data-dir at a per-project directory. See configuration.