Skip to content

LivTel/Docker_Loki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docker_Loki

A test bed for evaluating Loki for LT Operations. Not a mature installation.

This Docker contains Loki (queryable log database) and Alloy (ingestion agent). Typically these would probably be run in a single docker compose with grafana since they are all interdependent, but for testing and evaluation I have left the Grafana running in its own environment (https://github.com/LivTel/Docker_SDBgrafana).

Grafana docs include examples of merging the entire stack into single compose. See https://grafana.com/docs/loki/latest/setup/install/docker/. (wget https://raw.githubusercontent.com/grafana/loki/v3.4.1/production/docker-compose.yaml -O docker-compose.yaml) But we are not using that.

Uses the official docker images of Loki and Alloy from Grafana.

Only things here are the compose file and a single server config for each of Loki and Alloy. The compose mounts the external NAS and writes all data into there. That means all Loki databases persist externally and you can stop and start this docker whenever you like without loss of data.

Instructions

git clone https://github.com/LivTel/Docker_Loki

cd Docker_Loki

docker compose up -d

Alloy will immediately start ingesting anything it can find in the directory specified in config.alloy.

This was developed on ltvmhost5. If you start it on some other host, there may be further config required. For example

  • adding the eng user to the docker group for permissions to run docker
  • reverse proxy in the front-end web server
  • mount the external RAID for the binds in the compose.yml

More information about how ltvmhost5 was configured to run docker on wiki GrafanaSDB.

External storage mount points are defined in compose.yml.

Ingesting more data

The location where Alloy looks for the log files to ingest is defined in config.alloy. Currently that is the big external RAID NAS, which is not where vlm are stored. You need to manually copy the vlm that you want ingested over from sdbserver.

Tests suggest it is better to drip feed the vlm into that ingestion directory one-per-minute in time sequential order. Loki/Alloy can ingest many files in parallel in a batch, but at some point it chokes and starts dropping some messages. When I ingest a year's worth (365) vlms at once, the Loki logs are full of errors. When I ingest one at a time, there are few or none.

Ingesting 2022 as a single file with individual vlm cat-ed together did not work either. Mostly worked reeally well, a months long gap in the middle of thee year? Odd. Needs trying again with a different year?

Deleting

There are docs for how to do it, but I have not yet got it to work. https://grafana.com/docs/loki/latest/reference/loki-http-api/#request-log-deletion On the other hand, re-ingesting takes about 1min per vlm, so it only takes a few hours to delete the entire /mnt/newarchive1/Dockershare/Docker_Loki and re-ingest them all.

Still to be done

  • Automatically scrape new vlm as they appear on sdbserver
  • Automate 'this year' in the timestamp. Does Alloy default to now if we do not specify a year?

Notes on things that do and do not work

Ingestion:

Not yet automated for daily ingestion. To ingest a vlm, just unzip it into /mnt/newarchive1/Dockershare/Docker_Loki/vlm

Possible Missing Data:

I ingested 2.6e6 lines of var-log-messages log files and there only seem to be 2.6e6 in the Loki database, so some lines seem to be getting lost. That is going to need investigation to find out which log lines are not getting ingested. This seems to be the ingerster choking when I give it a whole year's worth to ingest at once. I think a random subset of messages have been lost. Probably need to re-ingest everything in an ordered manner. See "Ingesting more data" above.

Repeats:

My current parser does not see these properly. It thinks the software process name is “last” and does not obviously flag to the user that this message has appeared hundreds of times. All you will see in Loki is that first instance which gets parsed properly. This is rather unfortunate. I do not yet have an elegant fix for this. May 8 20:14:34 mcc.lt.com Ept: <1e0020> More than 1 bit is set, state of OID 0x59 cannot be determined. May 8 20:15:05 mcc.lt.com last message repeated 31 times May 8 20:17:06 mcc.lt.com last message repeated 121 times May 8 20:22:52 mcc.lt.com last message repeated 346 times

Process PIDs:

A small number of processes log both their name and their process ID (PID) in the log. That is a problem because Loki sees every instance of the process as a different label. Instead of seeing all sshd process as being sshd, it sees thousands of different sshd process. Every time anyone logs into the computer, it creates a new label. I think it is more use lump all the sshd messages together under a single “sshd” label. I have ‘fixed’ this by ignoring the PID number. I.e., all sshd messages are being stored in Loki as “sshd[xxx]” and the PID number has been completely lost. I think that is probably going to be OK. I doubt those PID numbers will ever be useful to us in the current context, but be aware I have discarded that particular datum. This effects sshd, bootp, telnetd, ftpd, inetd, ntpd messages. Jun 13 14:02:02 node<<1>> sshd[10722]: log: Connection from 192.168.1.30 port 57168 Jun 13 14:02:02 node<<1>> sshd[10722]: log: RSA authentication for maintain accepted. Jun 13 14:02:02 node<<1>> sshd[10725]: log: executing remote command as user maintain Jun 13 14:02:02 node<<1>> sshd[10722]: log: Closing connection to 192.168.1.30 Jun 13 14:02:02 node<<1>> sshd[10727]: log: Connection from 192.168.1.30 port 57169 Jun 13 14:02:02 node<<1>> sshd[10727]: log: RSA authentication for maintain accepted. Jun 13 14:02:02 node<<1>> sshd[10730]: log: executing remote command as user maintain but in Loki, we have Jun 13 14:02:02 node<<1>> sshd[xxx]: log: Connection from 192.168.1.30 port 57168 Jun 13 14:02:02 node<<1>> sshd[xxx]: log: RSA authentication for maintain accepted. Jun 13 14:02:02 node<<1>> sshd[xxx]: log: executing remote command as user maintain Jun 13 14:02:02 node<<1>> sshd[xxx]: log: Closing connection to 192.168.1.30 Jun 13 14:02:02 node<<1>> sshd[xxx]: log: Connection from 192.168.1.30 port 57169 Jun 13 14:02:02 node<<1>> sshd[xxx]: log: RSA authentication for maintain accepted. Jun 13 14:02:02 node<<1>> sshd[xxx]: log: executing remote command as user maintain

New Year:

Overnight on the night of 31st Dec / New Year, the logs are not correctly ingested. If you look at the log format, there is no year! I know how to fix this, but it is only that one night. I will fix it at some point, but for now just do not use 31st Dec.

About

A Docker to hold Loki and Alloy for LT var-log-messages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors