Docker_Loki

A test bed for evaluating Loki for LT Operations. Not a mature installation.

This Docker contains Loki (queryable log database) and Alloy (ingestion agent). Typically these would probably be run in a single docker compose with grafana since they are all interdependent, but for testing and evaluation I have left the Grafana running in its own environment (https://github.com/LivTel/Docker_SDBgrafana).

Grafana docs include examples of merging the entire stack into single compose. See https://grafana.com/docs/loki/latest/setup/install/docker/. (wget https://raw.githubusercontent.com/grafana/loki/v3.4.1/production/docker-compose.yaml -O docker-compose.yaml) But we are not using that.

Uses the official docker images of Loki and Alloy from Grafana.

Only things here are the compose file and a single server config for each of Loki and Alloy. The compose mounts the external NAS and writes all data into there. That means all Loki databases persist externally and you can stop and start this docker whenever you like without loss of data.

Instructions

git clone https://github.com/LivTel/Docker_Loki

cd Docker_Loki

docker compose up -d

Alloy will immediately start ingesting anything it can find in the directory specified in config.alloy.

This was developed on ltvmhost5. If you start it on some other host, there may be further config required. For example

adding the eng user to the docker group for permissions to run docker
reverse proxy in the front-end web server
mount the external RAID for the binds in the compose.yml

More information about how ltvmhost5 was configured to run docker on wiki GrafanaSDB.

External storage mount points are defined in compose.yml.

Ingesting more data

The location where Alloy looks for the log files to ingest is defined in config.alloy. Currently that is the big external RAID NAS, which is not where vlm are stored. You need to manually copy the vlm that you want ingested over from sdbserver.

Tests suggest it is better to drip feed the vlm into that ingestion directory one-per-minute in time sequential order. Loki/Alloy can ingest many files in parallel in a batch, but at some point it chokes and starts dropping some messages. When I ingest a year's worth (365) vlms at once, the Loki logs are full of errors. When I ingest one at a time, there are few or none.

Ingesting 2022 as a single file with individual vlm cat-ed together did not work either. Mostly worked reeally well, a months long gap in the middle of thee year? Odd. Needs trying again with a different year?

Deleting

There are docs for how to do it, but I have not yet got it to work. https://grafana.com/docs/loki/latest/reference/loki-http-api/#request-log-deletion On the other hand, re-ingesting takes about 1min per vlm, so it only takes a few hours to delete the entire /mnt/newarchive1/Dockershare/Docker_Loki and re-ingest them all.

Still to be done

Automatically scrape new vlm as they appear on sdbserver
Automate 'this year' in the timestamp. Does Alloy default to now if we do not specify a year?

Notes on things that do and do not work

Ingestion:

Not yet automated for daily ingestion. To ingest a vlm, just unzip it into /mnt/newarchive1/Dockershare/Docker_Loki/vlm

Possible Missing Data:

I ingested 2.6e6 lines of var-log-messages log files and there only seem to be 2.6e6 in the Loki database, so some lines seem to be getting lost. That is going to need investigation to find out which log lines are not getting ingested. This seems to be the ingerster choking when I give it a whole year's worth to ingest at once. I think a random subset of messages have been lost. Probably need to re-ingest everything in an ordered manner. See "Ingesting more data" above.

Repeats:

My current parser does not see these properly. It thinks the software process name is “last” and does not obviously flag to the user that this message has appeared hundreds of times. All you will see in Loki is that first instance which gets parsed properly. This is rather unfortunate. I do not yet have an elegant fix for this. May 8 20:14:34 mcc.lt.com Ept: <1e0020> More than 1 bit is set, state of OID 0x59 cannot be determined. May 8 20:15:05 mcc.lt.com last message repeated 31 times May 8 20:17:06 mcc.lt.com last message repeated 121 times May 8 20:22:52 mcc.lt.com last message repeated 346 times

Process PIDs:

A small number of processes log both their name and their process ID (PID) in the log. That is a problem because Loki sees every instance of the process as a different label. Instead of seeing all sshd process as being sshd, it sees thousands of different sshd process. Every time anyone logs into the computer, it creates a new label. I think it is more use lump all the sshd messages together under a single “sshd” label. I have ‘fixed’ this by ignoring the PID number. I.e., all sshd messages are being stored in Loki as “sshd[xxx]” and the PID number has been completely lost. I think that is probably going to be OK. I doubt those PID numbers will ever be useful to us in the current context, but be aware I have discarded that particular datum. This effects sshd, bootp, telnetd, ftpd, inetd, ntpd messages. Jun 13 14:02:02 node<<1>> sshd[10722]: log: Connection from 192.168.1.30 port 57168 Jun 13 14:02:02 node<<1>> sshd[10722]: log: RSA authentication for maintain accepted. Jun 13 14:02:02 node<<1>> sshd[10725]: log: executing remote command as user maintain Jun 13 14:02:02 node<<1>> sshd[10722]: log: Closing connection to 192.168.1.30 Jun 13 14:02:02 node<<1>> sshd[10727]: log: Connection from 192.168.1.30 port 57169 Jun 13 14:02:02 node<<1>> sshd[10727]: log: RSA authentication for maintain accepted. Jun 13 14:02:02 node<<1>> sshd[10730]: log: executing remote command as user maintain but in Loki, we have Jun 13 14:02:02 node<<1>> sshd[xxx]: log: Connection from 192.168.1.30 port 57168 Jun 13 14:02:02 node<<1>> sshd[xxx]: log: RSA authentication for maintain accepted. Jun 13 14:02:02 node<<1>> sshd[xxx]: log: executing remote command as user maintain Jun 13 14:02:02 node<<1>> sshd[xxx]: log: Closing connection to 192.168.1.30 Jun 13 14:02:02 node<<1>> sshd[xxx]: log: Connection from 192.168.1.30 port 57169 Jun 13 14:02:02 node<<1>> sshd[xxx]: log: RSA authentication for maintain accepted. Jun 13 14:02:02 node<<1>> sshd[xxx]: log: executing remote command as user maintain

New Year:

Overnight on the night of 31st Dec / New Year, the logs are not correctly ingested. If you look at the log format, there is no year! I know how to fix this, but it is only that one night. I will fix it at some point, but for now just do not use 31st Dec.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md
compose.yml		compose.yml
config.alloy		config.alloy
config_2024.alloy		config_2024.alloy
loki.yaml		loki.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docker_Loki

Instructions

Ingesting more data

Deleting

Still to be done

Notes on things that do and do not work

Ingestion:

Possible Missing Data:

Repeats:

Process PIDs:

New Year:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Docker_Loki

Instructions

Ingesting more data

Deleting

Still to be done

Notes on things that do and do not work

Ingestion:

Possible Missing Data:

Repeats:

Process PIDs:

New Year:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages